WEBVTT
Kind: captions
Language: en

00:00:00.030 --> 00:00:05.549
hey everyone we're coming back with data
exposed and for the first episode here

00:00:05.549 --> 00:00:11.280
we're going to talk about Machine Learning
Services in Azure HDInsight.

00:00:11.280 --> 00:00:15.839
Katherine is gonna tell us more about what we've introduced in this new release and what

00:00:15.839 --> 00:00:20.910
are the various components that you can
start leveraging right away with your

00:00:20.910 --> 00:00:25.800
HDInsight clusters so join us for this
video and we'd have a lot of fun talking

00:00:25.800 --> 00:00:34.530
and demoing the capabilities welcome
everybody

00:00:34.530 --> 00:00:40.530
We're here reviving the 'Data Exposed' show and the first topic that we have today

00:00:40.530 --> 00:00:44.770
on the Data Exposed show here is ML
Services in Azure HDInsight.

00:00:44.770 --> 00:00:49.140
 

00:00:49.140 --> 00:00:54.360
We're doing a new announcement with Machine Learning
Services 9.3. I'm Nishant Thacker. I'm a

00:00:54.360 --> 00:01:00.090
Technical Product Manager for Analytics
and AI at Microsoft and today as a guest

00:01:00.090 --> 00:01:04.710
we have Katherine Kampf. Catherine why
don't you introduce yourself. hello as

00:01:04.710 --> 00:01:08.939
Nishant said my name is Katherine, I'm a
Program Manager on the Azure HDInsight

00:01:08.939 --> 00:01:12.750
team and I'm excited to be here
introducing our latest release of ML

00:01:12.750 --> 00:01:16.530
Services. Awesome! Thanks Katherine for
joining us today.

00:01:16.530 --> 00:01:23.070
So, we're talking about ML Services
in HDInsight let's do a quick refresher

00:01:23.070 --> 00:01:30.030
first on what HDInsight is for audiences.
Yeah, so HDInsight as some of you

00:01:30.030 --> 00:01:34.860
may know is a fully managed analytics
service hosted on Azure so we have be

00:01:34.860 --> 00:01:39.270
like you spin up clusters in just a few
minutes and then it's got management

00:01:39.270 --> 00:01:44.850
capabilities on our end and a 99.9
percent availability SLA and a bunch of

00:01:44.850 --> 00:01:49.320
tooling built around it whether you want
enhancing enhancements and monitoring or

00:01:49.320 --> 00:01:53.850
different development environments we
want to make it as easy as possible we

00:01:53.850 --> 00:01:56.640
have a bunch of different cluster types
whether you want to use Spark,

00:01:56.640 --> 00:02:02.310
Hadoop, Hive, HBase or what we're talking about today, which is a previously named R

00:02:02.310 --> 00:02:07.320
server and now our new introduction of
ML Services cluster type in HDI. Perfect!

00:02:07.320 --> 00:02:12.060
thanks Katherine. So, just just to
understand HDInsight better and move on

00:02:12.060 --> 00:02:17.050
to the ML Services offering, HDInsight is our, kind of, open

00:02:17.050 --> 00:02:20.800
source offering for all the various
cluster types if you want to do

00:02:20.800 --> 00:02:24.790
streaming with Kafka and Storm, you want
to do No-SQL with HBase, you want to

00:02:24.790 --> 00:02:30.730
bring in Hadoop or Spark, all of that as a
hosted cluster environment within the

00:02:30.730 --> 00:02:35.470
Azure infrastructure over here and you
can leverage any of these open source

00:02:35.470 --> 00:02:37.960
tool kits and leverage them to your
advantage.

00:02:37.960 --> 00:02:44.920
Now, all this while with HDInsight we've
had something called R server, and now

00:02:44.920 --> 00:02:51.370
we expanding this R Server capability to
be Machine Learning Services which is

00:02:51.370 --> 00:02:55.180
really exciting.
So now we're bringing Python in addition

00:02:55.180 --> 00:02:57.280
to our... exactly!

00:02:57.280 --> 00:02:59.500
tell me a little more about this
Katherine. Yeah, so with R server we started

00:02:59.500 --> 00:03:03.430
getting a lot of popularity in our
community. R users really loved it but

00:03:03.430 --> 00:03:07.810
we know data scientists don't just use
R anymore data scientists love Python

00:03:07.810 --> 00:03:12.880
as well and we didn't want to alienate
that audience we wanted to expand as

00:03:12.880 --> 00:03:15.700
many of our capabilities as possible
it's a tailor to the Python community

00:03:15.700 --> 00:03:20.350
as well and let you use both are both
Python whatever you desire if you want

00:03:20.350 --> 00:03:24.400
to use both all within the same cluster
type on HDInsight now this is really

00:03:24.400 --> 00:03:29.470
important as Katherine mentioned, Machine Learning services brings the best of R

00:03:29.470 --> 00:03:35.770
and Python over here trying to bring it
closer to what your comfort zone is as a

00:03:35.770 --> 00:03:40.239
data scientist and you can play around
with any of the tools, any of the

00:03:40.239 --> 00:03:45.190
frameworks, any of the libraries in these
languages and bring them to distribute

00:03:45.190 --> 00:03:50.680
it across the Spark cluster in HDInsight.
So tell me a little more about what the

00:03:50.680 --> 00:03:56.350
differences with the ML Services and the
R service/ R server in HDInsight is.

00:03:56.350 --> 00:04:00.010
Yes, of course. Spark is a very
exciting technology we see a lot of

00:04:00.010 --> 00:04:04.060
usage for it in the open source
community and we as Microsoft wanted to

00:04:04.060 --> 00:04:07.959
be able to combine the power of that
open source and the community around it

00:04:07.959 --> 00:04:12.610
with some of our proprietary investments
in the ML and AI space so with that

00:04:12.610 --> 00:04:16.660
we've got a bunch of strong functions
for parallelization and pleasingly

00:04:16.660 --> 00:04:20.919
parallel workloads as well as some of
our pre trained models that we've built

00:04:20.919 --> 00:04:24.700
out that you can now use and take
advantage of within the ML Services cluster.

00:04:26.920 --> 00:04:30.130
This is great so in the
addition to just the Revolution R that

00:04:30.130 --> 00:04:35.530
is kind of imbibed so very natively
inside of all of our assets now we're

00:04:35.530 --> 00:04:41.260
taking Python and bring that closer over
here as well with the help of like

00:04:41.260 --> 00:04:45.520
creating some of these algorithms and
pre-training them so that you don't have

00:04:45.520 --> 00:04:50.380
to start from scratch exactly right
let's dig deeper into ml services and

00:04:50.380 --> 00:04:54.220
understand the individual components
there Katherine.  Yeah, so these are a few of

00:04:54.220 --> 00:04:58.390
our major features of course we talked
about the Python support which is very

00:04:58.390 --> 00:05:03.460
exciting and as I was mentioning with
this being a Microsoft investment we

00:05:03.460 --> 00:05:08.350
have really easy simple
operationalization, so if you want

00:05:08.350 --> 00:05:11.290
to deploy in SQL Server if that's
where your data lives we want to make

00:05:11.290 --> 00:05:15.820
that easy for you to do or a Restful API
we want to enhance all of those and

00:05:15.820 --> 00:05:19.390
communicate with a bunch of different
Azure data sources or on-premises data

00:05:19.390 --> 00:05:23.350
sources to make sure that we can
communicate with wherever your data

00:05:23.350 --> 00:05:28.150
lives and we have a bunch of different
parallel algorithms or if you want to

00:05:28.150 --> 00:05:32.680
write your own or you know we'll know
that data scientists want to use the

00:05:32.680 --> 00:05:36.010
best algorithm for their current
situation so if you want to try out

00:05:36.010 --> 00:05:41.229
H2O if you want to use pure Spark with
SparklyR or PySpark we want to let you

00:05:41.229 --> 00:05:44.530
easily interoperates you can try out a
bunch of different things as a data

00:05:44.530 --> 00:05:48.370
scientist we know it's difficult process
you want to try as many things as

00:05:48.370 --> 00:05:50.669
possible and we want to make it easy for
you to do that

00:05:50.669 --> 00:05:54.640
perfect so you want to start with
SQL Server or you have Machine Learning

00:05:54.640 --> 00:05:59.169
Services in SQL Server, you want to
move on to big data, you have Machine

00:05:59.169 --> 00:06:04.030
Learning Services with HDInsight, and
then if you want to use some third-party

00:06:04.030 --> 00:06:08.620
libraries you can go ahead and bring
about SparklyR and H20, and all of that

00:06:08.620 --> 00:06:12.580
goodness of it inside of this
environment over here. Yeah, so we want to

00:06:12.580 --> 00:06:16.210
make it as flexible as possible for you
to use your preferred libraries,

00:06:16.210 --> 00:06:18.345
languages, frameworks etc.

00:06:18.345 --> 00:06:22.780
Awesome, enough
of talking it's actually dive into some

00:06:22.780 --> 00:06:27.610
showing and tell you about that so let's
see a demo on machine learning services

00:06:27.610 --> 00:06:33.610
an easy answer so if we go so I have an
hdinsight ml service is 9.3 cluster type

00:06:33.610 --> 00:06:37.330
and this is my Jupyter hub, so I'm a big
fan of Jupyter and if you're an R

00:06:37.330 --> 00:06:40.810
user of course we ship our studio as
well so our studio

00:06:40.810 --> 00:06:45.520
in addition is available on the HDI edge
node so easy for you to take advantage

00:06:45.520 --> 00:06:51.639
of that as an AR user but I like Jupiter
so here's a simple example of training a

00:06:51.639 --> 00:06:56.830
model so we can start off and get our
rx-spark connect so this will start off

00:06:56.830 --> 00:07:02.320
our Spark session and we can immediately
start taking advantage of some of the

00:07:02.320 --> 00:07:06.520
pure Spark capabilities so we can use
Spark read functionality and pull some

00:07:06.520 --> 00:07:10.690
data from a CSV. So this is a standard
flight and weather data set I'm sure

00:07:10.690 --> 00:07:14.860
you've seen it before but we're going to
be predicting airline delays whether or

00:07:14.860 --> 00:07:19.690
not the a flight will be delayed on 15
minutes so you can see here we go

00:07:19.690 --> 00:07:23.860
through some standard Spark data
transformations and split into our test

00:07:23.860 --> 00:07:27.700
and training data set and so this is
where we get into some of the ml

00:07:27.700 --> 00:07:31.870
services specific functionality with our
rxLogit function and with this we can

00:07:31.870 --> 00:07:36.610
train a logistic regression on those
data frames we just built out so right

00:07:36.610 --> 00:07:42.040
now we have a test data frame and a
training data frame of built together of

00:07:42.040 --> 00:07:45.250
that flights data set and the weather
data set to pull together some

00:07:45.250 --> 00:07:49.930
predictions on delays. Now, this is
important to notice here because ideally

00:07:49.930 --> 00:07:54.820
you would think that all of this is just
part of Spark, but this is actually not part

00:07:54.820 --> 00:07:59.229
of Spark. This is in addition to what the
Spark functionality offers as part of

00:07:59.229 --> 00:08:04.060
its Python capabilities inside of Spark
and we are extending that with the

00:08:04.060 --> 00:08:08.710
Machine Learning capabilities to take it
even a step further by distributing

00:08:08.710 --> 00:08:14.140
algorithms that Spark natively doesn't
understand we pre-built them we brought

00:08:14.140 --> 00:08:18.280
it to a stage where Spark can now
distribute them natively inside and we

00:08:18.280 --> 00:08:22.330
using specialized functions like rxlogit to go ahead and distribute them.

00:08:22.330 --> 00:08:26.350
Exactly, yeah so we can train this model
and pull out some of those key features

00:08:26.350 --> 00:08:29.979
we want to use and then here we can see
what our model looks like and then we

00:08:29.979 --> 00:08:34.990
can pull in the help of another rx
function our expert it to see our models

00:08:34.990 --> 00:08:38.650
performing so we want to we just trained
on our training data set now we want to

00:08:38.650 --> 00:08:42.849
see what our testing looks like so we
use our predict function to put together

00:08:42.849 --> 00:08:47.070
our predictions based on our test data
frame and then we can even pull in

00:08:47.070 --> 00:08:51.250
Scikit-Learn
to do some accuracy analysis so we can

00:08:51.250 --> 00:08:54.610
look at the area under the curve and see
where our models performing

00:08:54.610 --> 00:08:59.079
and it's about 64% so it's a solid
starting point and from there we could

00:08:59.079 --> 00:09:02.350
play around with different algorithms or
different features to try to build that

00:09:02.350 --> 00:09:08.740
up or another good function that ml
services comes with is rx exact buy and

00:09:08.740 --> 00:09:12.220
what this lets you do is say you only
care about certain carriers in this

00:09:12.220 --> 00:09:15.430
flight scenario maybe you're looking for
a new credit card want to see who has

00:09:15.430 --> 00:09:20.260
the most delays where you should invest
in getting your miles and so here you

00:09:20.260 --> 00:09:23.920
can actually use that same logistic
regression function you just set up

00:09:23.920 --> 00:09:29.110
previously and split this up you can use
this keys equals carrier and what this

00:09:29.110 --> 00:09:34.089
will do is divide up your data and build
a model for each of those individual or

00:09:34.089 --> 00:09:38.380
Airlines which is really powerful and
can be applicable in a lot of situations

00:09:38.380 --> 00:09:42.399
so here obviously we get a bit more
variability and what accuracies were

00:09:42.399 --> 00:09:46.750
seeing so we've seen some a little or at
62 but you can see we get some as high

00:09:46.750 --> 00:09:52.600
up as 68% which is really great for
individual carriers yeah so that's an

00:09:52.600 --> 00:09:57.339
introduction of some of the new exciting
functions we're bringing to Python and

00:09:57.339 --> 00:10:01.329
of course all of these are using the
power of spark so they paralyze well and

00:10:01.329 --> 00:10:06.190
they run incredibly fast.  All right, so
this also comes with all the goodness of

00:10:06.190 --> 00:10:10.480
the open source engine that Spark is
itself, so you don't have to learn a

00:10:10.480 --> 00:10:14.500
new engine altogether, you can bring all
of your knowledge from the Spark

00:10:14.500 --> 00:10:19.329
perspective and extend it with newer
algorithms, newer models, newer

00:10:19.329 --> 00:10:22.959
capabilities inside of spark. Now
Katherine, tell us a little more about

00:10:22.959 --> 00:10:26.890
what the difference between these two? Is
it like Spark has some Python and R

00:10:26.890 --> 00:10:31.540
capabilities with SparkR and native
Python support inside a spark with PySpark?

00:10:31.540 --> 00:10:36.010
Yeah. And what's the difference
between what Machine Learning Services

00:10:36.010 --> 00:10:40.120
offers and what spark has natively? Yeah, 
so it's a lot of what I said the

00:10:40.120 --> 00:10:44.080
integration with the Microsoft ecosystem
and our investments in

00:10:44.080 --> 00:10:48.670
ML. So, especially one of, I think, 
important thing to notice is the

00:10:48.670 --> 00:10:52.150
pre-trained models so if you want to do
image featureization or you want to do

00:10:52.150 --> 00:10:55.840
sentiment analysis but you're a smaller
company or startup and you don't have

00:10:55.840 --> 00:11:00.100
access to that mass amounts of data to
train those models we ship them with ML

00:11:00.100 --> 00:11:03.760
Services you can easily start either
using those directly or doing some

00:11:03.760 --> 00:11:07.450
transfer learning so that's something
that that's an exciting functionality I

00:11:07.450 --> 00:11:12.220
think it brings pretty that's awesome
all right let's take a little deeper

00:11:12.220 --> 00:11:17.010
into what are the capabilities that
machine learning services actually has

00:11:17.010 --> 00:11:22.930
so we know it has a Jupyter Hub, we
know we host, like an, R studio inside

00:11:22.930 --> 00:11:29.050
an edge node, and that's important to
understand like ML Services in HDInsight

00:11:29.050 --> 00:11:32.920
actually come with the cluster and then
it comes with an edge node which gets

00:11:32.920 --> 00:11:38.020
the client tools installed and made
available for users to tap into it. Yeah..

00:11:38.020 --> 00:11:40.870
So tell us a little more and what the
structure is, and what are the components

00:11:40.870 --> 00:11:45.850
of.. Yeah so ML Services is... actually
standard HDInsight clusters are driven

00:11:45.850 --> 00:11:50.230
by a head node but with ML Services we
utilize the edge node as well to drive

00:11:50.230 --> 00:11:54.400
our Spark workloads and with that it
makes it easier for us to continue to

00:11:54.400 --> 00:11:58.840
ship different developer tools for you
so particular R studio. When we were

00:11:58.840 --> 00:12:02.290
first releasing our server we said we
know users like this and we want to make

00:12:02.290 --> 00:12:05.040
it as easy as possible for them we don't
want them to have to do any additional

00:12:05.040 --> 00:12:09.670
installation so we ship it right there
once your HDI clusters ready to go you

00:12:09.670 --> 00:12:13.060
can immediately log into our server and
start training your models or

00:12:13.060 --> 00:12:16.930
experimenting with your data science
workloads and as well of course we have

00:12:16.930 --> 00:12:20.920
for people who prefer vs code Visual
Studio we're integrated with that

00:12:20.920 --> 00:12:26.800
ecosystem as well this is awesome
so with ml services now you're able to

00:12:26.800 --> 00:12:32.590
bring in all the goodness of the tools
of your choice all the goodness of the

00:12:32.590 --> 00:12:37.510
frameworks and platform of your choice
and still leverage the native

00:12:37.510 --> 00:12:43.870
integration and the work that Microsoft
has put in to extend those capabilities

00:12:43.870 --> 00:12:50.170
even further and integrate it natively
within the Azure ecosystem I think that

00:12:50.170 --> 00:12:54.520
was a wonderful overview Katherine. I
invite all our users to actually go

00:12:54.520 --> 00:12:57.410
ahead and try this out. You can go to Azure,

00:12:57.410 --> 00:13:02.720
sign up for a free trial, spin up a
HDIsight cluster and choose ML Services

00:13:02.720 --> 00:13:08.629
and with the 9.3 version you get both
R Services as well as Python services

00:13:08.629 --> 00:13:14.089
built natively inside the cluster
capabilities itself. Please let us know

00:13:14.089 --> 00:13:19.999
by tweeting @AzureHDInsight what do
you think of this new release and we'd

00:13:19.999 --> 00:13:24.649
be happy to take all the feedback that
you may have to share with us. Thank you

00:13:24.649 --> 00:13:29.937
Katherine.. Thanks..

00:13:29.937 --> 00:13:32.257
and thank you everybody!!
Thanks!

00:13:32.257 --> 00:13:38.749
 

