WEBVTT

00:00:00.000 --> 00:00:03.070
>> SQL Server 2019 big
data clusters provide

00:00:03.070 --> 00:00:04.950
Data Pools to optimize

00:00:04.950 --> 00:00:07.905
query performance when reaching
out to multiple sources.

00:00:07.905 --> 00:00:10.080
Umachandar is here with
us today to tell us all

00:00:10.080 --> 00:00:12.450
about it today on Data Exposed.

00:00:12.450 --> 00:00:23.550
[MUSIC].

00:00:23.550 --> 00:00:26.250
>> Hi and welcome to another
episode of Data Exposed.

00:00:26.250 --> 00:00:29.100
I'm your host Jeroen and
today we have you see the us

00:00:29.100 --> 00:00:32.085
to talk about Data Pools
in SQL big data clusters.

00:00:32.085 --> 00:00:33.675
Hi Umachandar welcome to the show.

00:00:33.675 --> 00:00:35.460
>> Hi, I'm happy to be here.

00:00:35.460 --> 00:00:38.160
>> So Data Pools in
big data clusters.

00:00:38.160 --> 00:00:40.505
What is it? What do I need to know?

00:00:40.505 --> 00:00:42.440
>> Sure yeah, let me quickly start

00:00:42.440 --> 00:00:45.110
by showing what a
Big Data Cluster is.

00:00:45.110 --> 00:00:48.090
So this slide shows the deployment of

00:00:48.090 --> 00:00:52.110
SQL Server 19 Big Data
Cluster and today we're

00:00:52.110 --> 00:00:56.360
going to talk about the Data
Pools here which are basically

00:00:56.360 --> 00:00:59.300
two SQL server instances which can be

00:00:59.300 --> 00:01:02.960
used to offload some
query [inaudible].

00:01:02.960 --> 00:01:04.810
So let's see what is a Data Pool.

00:01:04.810 --> 00:01:08.390
So it's basically SQL server
storage and compute which is

00:01:08.390 --> 00:01:13.550
provided to write the queries
and store data in the Data Pool.

00:01:13.550 --> 00:01:16.310
It can be manipulated by running

00:01:16.310 --> 00:01:20.900
some external table operations
on it and you populate data

00:01:20.900 --> 00:01:24.440
into the Data Pool and
later you can actually

00:01:24.440 --> 00:01:28.970
offload some of your analytic
queries over to this Data Pool.

00:01:28.970 --> 00:01:32.400
So that is the primary
purpose of the Data Pool.

00:01:32.400 --> 00:01:37.140
So now let's may be quickly take
a look at some of the scenarios.

00:01:37.140 --> 00:01:39.980
So the common scenario is you're

00:01:39.980 --> 00:01:42.710
getting data from
different data sources,

00:01:42.710 --> 00:01:44.570
some of them are fast,

00:01:44.570 --> 00:01:47.345
others are slower in nature.

00:01:47.345 --> 00:01:51.050
So now you want to basically optimize

00:01:51.050 --> 00:01:55.535
this query so you could
actually take that query

00:01:55.535 --> 00:02:00.350
and store the results in the
Data Pool and this allows you to

00:02:00.350 --> 00:02:02.900
basically use that like

00:02:02.900 --> 00:02:07.300
a staging area where you
can run all your analytics.

00:02:07.300 --> 00:02:10.830
The other example is if you're
doing machine learning,

00:02:10.830 --> 00:02:12.830
there are a lot of
scenarios where you

00:02:12.830 --> 00:02:14.960
would generate features
and hundreds and

00:02:14.960 --> 00:02:18.290
thousands of features and
you want to basically store

00:02:18.290 --> 00:02:23.040
the dataset in a place where
you can do repeated analysis.

00:02:23.040 --> 00:02:26.145
That's another place where
the Data Pool helps.

00:02:26.145 --> 00:02:29.540
>> So in that case it's not
only for query optimization

00:02:29.540 --> 00:02:31.190
but also for making sure
you don't have to go

00:02:31.190 --> 00:02:32.975
back to the data source
all the time, right?

00:02:32.975 --> 00:02:33.230
>> Yes.

00:02:33.230 --> 00:02:34.385
>> If you want to reuse the data.

00:02:34.385 --> 00:02:34.625
>> Yeah.

00:02:34.625 --> 00:02:35.660
>> Cool, that sounds very cool.

00:02:35.660 --> 00:02:37.355
So can you actually
show us how this works?

00:02:37.355 --> 00:02:40.550
>> Yeah, let's quickly go
to assess how much is here.

00:02:40.550 --> 00:02:42.230
So what I have here is

00:02:42.230 --> 00:02:45.530
a connection to the SQL
Server big data cluster.

00:02:45.530 --> 00:02:48.890
I am connected to the SQL
Server master instance.

00:02:48.890 --> 00:02:51.905
So the way to create a Data Pool

00:02:51.905 --> 00:02:55.625
is first by creating
this data source object.

00:02:55.625 --> 00:02:58.100
So here as this name indicates

00:02:58.100 --> 00:03:01.325
here this basically says what
type of a data source it is.

00:03:01.325 --> 00:03:04.400
Then this I'm going to
connect to a Data Pool.

00:03:04.400 --> 00:03:05.530
>> Okay.

00:03:05.530 --> 00:03:08.780
>> That's how you
create it. After that

00:03:08.780 --> 00:03:11.360
you define the table
which you want to

00:03:11.360 --> 00:03:13.940
store in the Data
Pool and that's done

00:03:13.940 --> 00:03:16.670
by running this external
table statement.

00:03:16.670 --> 00:03:17.845
>> Okay.

00:03:17.845 --> 00:03:21.090
>> This looks like your regular
create table statement,

00:03:21.090 --> 00:03:24.110
there are columns which
you need to define and in

00:03:24.110 --> 00:03:27.725
addition since there are many SQL
server instances on the backend.

00:03:27.725 --> 00:03:28.115
>> Right.

00:03:28.115 --> 00:03:30.740
>> You just need to say how you
want to distribute the data.

00:03:30.740 --> 00:03:32.630
>> So with this you can tell, okay,

00:03:32.630 --> 00:03:36.155
so we got an even or approximately
even distribution across.

00:03:36.155 --> 00:03:36.515
>> Yeah.

00:03:36.515 --> 00:03:37.400
>> Across the service.

00:03:37.400 --> 00:03:37.685
>> Yes.

00:03:37.685 --> 00:03:38.000
>> Okay.

00:03:38.000 --> 00:03:41.330
>> So that's what the round robin
does is basically distribute

00:03:41.330 --> 00:03:44.825
the data across how many of
Data Pool instances you have.

00:03:44.825 --> 00:03:45.865
>> Okay.

00:03:45.865 --> 00:03:49.370
>> The next thing you can
do is like query the data.

00:03:49.370 --> 00:03:50.780
So let's first look at some of

00:03:50.780 --> 00:03:54.155
the sample data which in
this example I have in

00:03:54.155 --> 00:03:57.680
HDFS directory and basically

00:03:57.680 --> 00:04:00.155
I want to take these
clickstream data,

00:04:00.155 --> 00:04:04.490
do some query and populate
into the Data Pool.

00:04:04.490 --> 00:04:08.915
The way you would do it is by
running just an insert statement.

00:04:08.915 --> 00:04:13.490
So the source of the insert
statement is just a SQL query,

00:04:13.490 --> 00:04:18.140
and you can simply run the
query and populate the data.

00:04:18.140 --> 00:04:21.320
I won't run through the
inserts but let's quickly

00:04:21.320 --> 00:04:25.205
see the data which is already
populated in the Data Pool.

00:04:25.205 --> 00:04:28.445
So this query should return
about 2,000 rows, right?

00:04:28.445 --> 00:04:32.930
So now just like any other
table you can actually join

00:04:32.930 --> 00:04:37.460
this table with other tables
in your SQL Server and get

00:04:37.460 --> 00:04:39.680
the results and transparently

00:04:39.680 --> 00:04:42.200
this SQL Server instance
is going to talk to

00:04:42.200 --> 00:04:47.555
the Data Pool instances to run
the queries and get the results.

00:04:47.555 --> 00:04:50.330
So that's at a high level
of what a Data Pool is.

00:04:50.330 --> 00:04:51.530
>> Cool awesome.

00:04:51.530 --> 00:04:54.390
That's very impressive.
Thanks a lot man.

00:04:54.390 --> 00:04:55.530
>> Cool yeah, thanks.

00:04:55.530 --> 00:04:58.050
>> Okay. So thanks for watching.

00:04:58.050 --> 00:04:59.590
Please like and subscribe,

00:04:59.590 --> 00:05:02.150
comment on the video
if you want us to

00:05:02.150 --> 00:05:06.360
record something else and hope
to see you next time. Thanks.

00:05:06.360 --> 00:05:20.860
[MUSIC]

