WEBVTT

00:00:00.000 --> 00:00:10.700
[MUSIC].

00:00:10.700 --> 00:00:12.770
>> Hi. My name is Umachandar,

00:00:12.770 --> 00:00:15.510
I'm a Program Manager
in the SQL Server team.

00:00:15.510 --> 00:00:19.065
Today I'm going to show
you how you can query

00:00:19.065 --> 00:00:24.120
HDFS data from SQL Server
in a big data cluster.

00:00:24.120 --> 00:00:28.010
So what I have here
is Azure Data Studio.

00:00:28.010 --> 00:00:32.900
First, I'm going to connect to
the SQL Server Master instance,

00:00:32.900 --> 00:00:36.360
which is running inside
the big data cluster.

00:00:38.180 --> 00:00:41.360
Now from here, I can work with

00:00:41.360 --> 00:00:46.850
both MySQL databases
and the data in HDFS.

00:00:46.850 --> 00:00:48.640
So as you can see here,

00:00:48.640 --> 00:00:52.070
I have my regular
SQL Server databases

00:00:52.070 --> 00:00:57.875
and I can also go and
browse the data in HDFS.

00:00:57.875 --> 00:01:02.570
So let's first look at what
data I'm going to query.

00:01:02.570 --> 00:01:06.970
So I have a file here
called Web Click Streams.

00:01:06.970 --> 00:01:10.985
So this file contains
data about users,

00:01:10.985 --> 00:01:14.180
like what items they've purchased,

00:01:14.180 --> 00:01:19.160
how many times did they click on
a particular category, and so on.

00:01:19.160 --> 00:01:24.120
So let's look at how you
can query this data.

00:01:25.330 --> 00:01:28.205
So I have a notebook here.

00:01:28.205 --> 00:01:30.829
First, I'm going to set the context

00:01:30.829 --> 00:01:34.000
of the notebook to the database.

00:01:34.000 --> 00:01:38.535
Which is sales, as
shown in this example.

00:01:38.535 --> 00:01:43.354
So next, in order to
query the data in HDFS,

00:01:43.354 --> 00:01:48.880
we need to create something
called an External Data Source.

00:01:48.880 --> 00:01:52.790
So it's an object in the
database which tells

00:01:52.790 --> 00:01:57.140
SQL Server where the
HDFS data resides.

00:01:57.140 --> 00:01:59.940
So in this example,

00:01:59.940 --> 00:02:04.190
this string here which
starts with SQL HDFS,

00:02:04.190 --> 00:02:10.130
represents the HDFS data
in the big data cluster.

00:02:10.130 --> 00:02:14.450
So let's first go and
create this data source.

00:02:14.450 --> 00:02:17.870
So after you've created
the data source,

00:02:17.870 --> 00:02:21.905
you need to specify what file format

00:02:21.905 --> 00:02:27.410
the data source supports and
what you're going to query.

00:02:27.410 --> 00:02:29.375
So in this case,

00:02:29.375 --> 00:02:34.910
I am going to query data which
is residing in Parquet files.

00:02:34.910 --> 00:02:40.485
So the format of the parquet
file is specified here.

00:02:40.485 --> 00:02:45.960
So let's first go ahead and
create this file format object.

00:02:46.090 --> 00:02:52.640
So now the next step is to create
an external table in SQL Server.

00:02:52.640 --> 00:02:57.020
So this represents the columns
which are going to query

00:02:57.020 --> 00:03:02.165
from the files and also
the location of the files.

00:03:02.165 --> 00:03:03.710
So in this case,

00:03:03.710 --> 00:03:08.180
I'm going to query all the
data from this Directory.

00:03:08.180 --> 00:03:12.520
So let's quickly see
what that contains.

00:03:12.520 --> 00:03:16.670
So I will expand the directory here.

00:03:16.670 --> 00:03:20.780
As you can see, there are
three Parquet files here.

00:03:20.780 --> 00:03:23.660
So now we're going to see how

00:03:23.660 --> 00:03:26.510
you can query those
files from SQL Server.

00:03:26.510 --> 00:03:31.310
So I've now created an external
table object in SQL Server,

00:03:31.310 --> 00:03:37.920
which points to the Directory
which contains the Parquet files.

00:03:38.440 --> 00:03:43.680
First, let's look at
some rows from the file.

00:03:43.970 --> 00:03:48.675
So I run a select query
with the top hint,

00:03:48.675 --> 00:03:51.465
and I see 10 rows.

00:03:51.465 --> 00:03:57.470
This basically previews the rows
which are read from those files.

00:03:57.470 --> 00:04:01.775
So you can also run slightly
more complex queries

00:04:01.775 --> 00:04:06.920
like just do a count on the
number of rows in the table.

00:04:06.920 --> 00:04:13.380
In this case, I get a result
back and it's about six million.

00:04:14.210 --> 00:04:17.705
You can also run more
advanced queries.

00:04:17.705 --> 00:04:20.720
I have a query here which is

00:04:20.720 --> 00:04:25.505
trying to find the top 10
users with the most clicks.

00:04:25.505 --> 00:04:27.850
So let's run that query.

00:04:27.850 --> 00:04:29.535
As you can see here,

00:04:29.535 --> 00:04:31.325
now you are able to run

00:04:31.325 --> 00:04:35.375
regular SQL queries
against the data in HDFS.

00:04:35.375 --> 00:04:37.910
So you can basically
treat the data in

00:04:37.910 --> 00:04:42.665
HDFS as if they are local
tables in your database.

00:04:42.665 --> 00:04:46.900
In this case, I can use the
features of SQL Server,

00:04:46.900 --> 00:04:50.060
and the syntax, and the
functions available in

00:04:50.060 --> 00:04:53.990
SQL Server to write the query.

00:04:53.990 --> 00:04:56.060
So as you can see,

00:04:56.060 --> 00:05:03.200
now I can basically get the
results from the files in HDFS.

00:05:03.200 --> 00:05:07.360
You can also perform
other joint operations.

00:05:07.360 --> 00:05:10.275
For example, I want to

00:05:10.275 --> 00:05:13.815
combine the ClickStream data
with some Inventory data,

00:05:13.815 --> 00:05:17.000
which is also residing
in the Parquet files

00:05:17.000 --> 00:05:21.800
in HDFS which is actually shown here.

00:05:21.800 --> 00:05:24.140
So let's run this query.

00:05:24.140 --> 00:05:28.400
In this case, what we're doing
in SQL server is actually

00:05:28.400 --> 00:05:34.170
joining two different datasets
which are residing in HDFS.

00:05:34.270 --> 00:05:40.790
You are able to specify that in
a SQL query using SQL syntax,

00:05:40.790 --> 00:05:44.275
and you can get the
results from that query.

00:05:44.275 --> 00:05:48.705
In this example, we are
actually going to SQL Server,

00:05:48.705 --> 00:05:51.180
reading the files from HDFS,

00:05:51.180 --> 00:05:53.845
passing the data in the files,

00:05:53.845 --> 00:05:56.195
performing the join operations

00:05:56.195 --> 00:06:00.425
and doing additional
aggregations in this case,

00:06:00.425 --> 00:06:03.785
like counting the number of clicks,

00:06:03.785 --> 00:06:08.790
adding the inventory
values, and so on.

00:06:09.500 --> 00:06:11.640
Last but not the least,

00:06:11.640 --> 00:06:14.250
you can also query

00:06:14.250 --> 00:06:17.480
this HDFS data and join it

00:06:17.480 --> 00:06:20.405
with other data sitting
in your database.

00:06:20.405 --> 00:06:22.175
So in this example,

00:06:22.175 --> 00:06:25.415
you can actually query
the Parquet files

00:06:25.415 --> 00:06:30.120
and join with table in SQL Server.

00:06:36.560 --> 00:06:39.270
As you can see now,

00:06:39.270 --> 00:06:44.170
the integration in SQL Server
Big Data Cluster makes it very

00:06:44.170 --> 00:06:49.215
easy for you to join the data
from HDFS to other data,

00:06:49.215 --> 00:06:52.214
either in HDFS or other tables,

00:06:52.214 --> 00:06:55.180
and get the information very easily.

00:06:55.180 --> 00:06:58.840
So now you can actually build
applications which query data

00:06:58.840 --> 00:07:02.790
from different data sources
very easily. Thank you.

00:07:02.790 --> 00:07:17.500
[MUSIC]

