WEBVTT

00:00:00.000 --> 00:00:10.500
[MUSIC].

00:00:10.500 --> 00:00:11.910
>> Hi and welcome back.

00:00:11.910 --> 00:00:14.970
My name is JRJ, and I'm here
to tell you about one of

00:00:14.970 --> 00:00:18.915
the most eagerly anticipated
features in SQL Server 2019,

00:00:18.915 --> 00:00:21.285
and that's data virtualization.

00:00:21.285 --> 00:00:23.175
What is data virtualization,

00:00:23.175 --> 00:00:25.440
and why is it so eagerly anticipated?

00:00:25.440 --> 00:00:27.510
Well, simply put,

00:00:27.510 --> 00:00:29.510
data virtualization enables you to

00:00:29.510 --> 00:00:31.670
bring all of your data together at

00:00:31.670 --> 00:00:35.780
query time rather than having to
build complex ETL pipelines in

00:00:35.780 --> 00:00:40.535
order to be able to unify
the data in a single query.

00:00:40.535 --> 00:00:44.150
So what I'm going to
do is rather than go

00:00:44.150 --> 00:00:47.540
through the details of data
virtualization at a conceptual level,

00:00:47.540 --> 00:00:49.730
I'm just going to show you
the differences between

00:00:49.730 --> 00:00:52.430
a local query and a
virtualized query,

00:00:52.430 --> 00:00:55.085
both partially and fully virtualized.

00:00:55.085 --> 00:00:56.280
So to do that,

00:00:56.280 --> 00:00:58.010
what we're going to do is
we're going to switch over

00:00:58.010 --> 00:01:00.270
now to Azure Data Studio,

00:01:00.270 --> 00:01:03.035
and you can see here I
have a workbook open,

00:01:03.035 --> 00:01:08.990
and let's go in and
start evaluating that.

00:01:08.990 --> 00:01:13.625
So here you can see I
have a very simple query.

00:01:13.625 --> 00:01:17.030
I have two local
tables in my database,

00:01:17.030 --> 00:01:19.160
and if I run that query,

00:01:19.160 --> 00:01:23.405
you could imagine the result
comes back nice and quickly.

00:01:23.405 --> 00:01:25.190
I have about a second,

00:01:25.190 --> 00:01:28.045
and I get my dataset
back in the notebook.

00:01:28.045 --> 00:01:31.630
However, what if all of that
data wasn't in the SQL Server?

00:01:31.630 --> 00:01:36.200
What if that data was actually
available in remote SQL Servers,

00:01:36.200 --> 00:01:40.145
and we wanted to access that
data all at the same time?

00:01:40.145 --> 00:01:43.700
Well, you can use data virtualization
to solve that problem.

00:01:43.700 --> 00:01:45.050
But in order to do it,

00:01:45.050 --> 00:01:47.030
we need to set up some metadata.

00:01:47.030 --> 00:01:50.510
So the first thing we need to
do is to create a master key,

00:01:50.510 --> 00:01:53.720
and a master key is a key inside

00:01:53.720 --> 00:01:55.910
the database that we use to protect

00:01:55.910 --> 00:01:58.660
all of the other metadata inside it.

00:01:58.660 --> 00:02:03.380
You can see from the metadata
here which algorithm we're using,

00:02:03.380 --> 00:02:06.110
when it's created, and
interesting things like that.

00:02:06.110 --> 00:02:10.745
Now I need to enable the PolyBase
feature in order to be able to

00:02:10.745 --> 00:02:16.310
access remote sources
and remote databases,

00:02:16.310 --> 00:02:19.220
and create a database credential to

00:02:19.220 --> 00:02:23.495
be able to authenticate
against those remote sources,

00:02:23.495 --> 00:02:28.835
and you can see here that I've
created a few in the past,

00:02:28.835 --> 00:02:30.200
as a couple of Oracle,

00:02:30.200 --> 00:02:32.225
and a couple of SQL
ones in there as well.

00:02:32.225 --> 00:02:36.680
But today, we're going to go
against a SQL Data Source,

00:02:36.680 --> 00:02:39.650
and you can see here that
in order to do that,

00:02:39.650 --> 00:02:41.730
I need to create an
external data source.

00:02:41.730 --> 00:02:45.390
Here, I specify my
location, in this case,

00:02:45.390 --> 00:02:49.160
a SQL Server address
somewhere in Azure,

00:02:49.160 --> 00:02:51.874
and I pass in that credential

00:02:51.874 --> 00:02:54.425
to enable that authentication
to take place.

00:02:54.425 --> 00:02:56.590
So let's go ahead and create that,

00:02:56.590 --> 00:03:00.880
and you can see again, there's
the metadata inside the database.

00:03:00.880 --> 00:03:03.040
Now, as a general rule,

00:03:03.040 --> 00:03:06.290
I like to keep the external
tables which define

00:03:06.290 --> 00:03:08.465
those external datasource objects

00:03:08.465 --> 00:03:11.210
separate from my internal tables,

00:03:11.210 --> 00:03:12.890
and I do that using a schema.

00:03:12.890 --> 00:03:16.500
So let's go ahead and
create an external schema,

00:03:17.660 --> 00:03:23.350
and now we can come down here and
create our first external table.

00:03:23.350 --> 00:03:25.730
The first external table
we're going to create is

00:03:25.730 --> 00:03:28.240
web click streams which
is the first table,

00:03:28.240 --> 00:03:31.315
and in this case it's
more like a fact table,

00:03:31.315 --> 00:03:34.755
and we're going to store that.

00:03:34.755 --> 00:03:36.490
So in that external database,

00:03:36.490 --> 00:03:38.375
we have exactly the same database.

00:03:38.375 --> 00:03:44.200
We're just using it again to
illustrate this scenario.

00:03:44.200 --> 00:03:50.515
Now, we can get onto the process
of virtualizing a clickstream,

00:03:50.515 --> 00:03:52.900
the web clickstreams table.

00:03:52.900 --> 00:03:56.500
Here you can see I have the
same table web clickstreams,

00:03:56.500 --> 00:03:58.660
but now I'm using the EXT schema.

00:03:58.660 --> 00:04:01.060
So I'm accessing the external table,

00:04:01.060 --> 00:04:02.440
but to all intents and purposes,

00:04:02.440 --> 00:04:05.630
the rest of the query
is exactly the same.

00:04:05.630 --> 00:04:08.225
If I run that query now,

00:04:08.225 --> 00:04:10.120
let's say it takes a
little bit longer because

00:04:10.120 --> 00:04:12.100
we're going to go and
get that data remotely,

00:04:12.100 --> 00:04:14.905
and you could say it's
about 3.5 seconds.

00:04:14.905 --> 00:04:17.260
But we can see that we've got

00:04:17.260 --> 00:04:20.785
that data here and
it's exactly the same.

00:04:20.785 --> 00:04:23.710
So everything underneath the hood

00:04:23.710 --> 00:04:27.065
is completely transparent
to me as a user.

00:04:27.065 --> 00:04:29.920
Now what if I actually go ahead and

00:04:29.920 --> 00:04:33.250
virtualize the second
external table in this query?

00:04:33.250 --> 00:04:35.680
You remember that the first
one was web clipstreams,

00:04:35.680 --> 00:04:38.905
that the second one
is the item table.

00:04:38.905 --> 00:04:41.090
So let's go ahead and do that,

00:04:41.090 --> 00:04:45.650
and now I have both
tables virtualized.

00:04:47.290 --> 00:04:52.290
So now, what happens when
I run this last query?

00:04:52.290 --> 00:04:57.565
This last query is going to
run exactly the same query,

00:04:57.565 --> 00:05:01.670
but both external
tables are virtualized,

00:05:01.940 --> 00:05:05.275
and you can see that actually
the query is almost as

00:05:05.275 --> 00:05:09.375
fast as the first
version, the local query.

00:05:09.375 --> 00:05:12.530
Now why is that? Why do we get
this difference in performance?

00:05:12.530 --> 00:05:14.780
Well, the reason is
that if you look at

00:05:14.780 --> 00:05:17.000
the SQL Servers
intelligent enough using

00:05:17.000 --> 00:05:20.600
its cost-based optimizer
to understand that

00:05:20.600 --> 00:05:24.725
both the tables are external and
they come from the same source,

00:05:24.725 --> 00:05:28.400
and that it can see that
it can push this join and

00:05:28.400 --> 00:05:32.030
the aggregation down
against that remote source.

00:05:32.030 --> 00:05:34.190
So we're leveraging the compute of

00:05:34.190 --> 00:05:37.445
that remote source to resolve
this query in real time.

00:05:37.445 --> 00:05:41.030
But that gives you a quick overview
of the capabilities that you

00:05:41.030 --> 00:05:44.750
get out of using data
virtualization technology

00:05:44.750 --> 00:05:48.470
and how you can actually
transparently present that data

00:05:48.470 --> 00:05:50.390
back to an end-user without having to

00:05:50.390 --> 00:05:52.520
make physical copies of that data,

00:05:52.520 --> 00:05:54.410
without having to move it or build

00:05:54.410 --> 00:05:56.420
a complex ETL pipeline in

00:05:56.420 --> 00:05:58.910
order to be able to
query data in real time.

00:05:58.910 --> 00:06:00.510
Thanks very much for joining,

00:06:00.510 --> 00:06:02.960
and I look forward to catching
up with you again soon.

00:06:02.960 --> 00:06:17.560
[MUSIC]

