WEBVTT

00:00:00.000 --> 00:00:03.345
>> SQL Server 2019 introduces
big data clusters.

00:00:03.345 --> 00:00:04.860
It has Spark integrated.

00:00:04.860 --> 00:00:09.300
Shiv is here to tell us all about
that today on Data Exposed.

00:00:09.300 --> 00:00:20.220
[MUSIC].

00:00:20.220 --> 00:00:23.085
>> Hi and welcome to another
episode of Data Exposed.

00:00:23.085 --> 00:00:25.890
I'm your host, Jeroen and
today we have Shiv here with

00:00:25.890 --> 00:00:28.485
us to talk about Spark
on big data clusters.

00:00:28.485 --> 00:00:29.780
So welcome to the show, Shiv.

00:00:29.780 --> 00:00:30.600
>> Thank you, Jeroen.

00:00:30.600 --> 00:00:34.705
>> So spark, let's start at
the basics. What is Spark?

00:00:34.705 --> 00:00:38.790
>> Spark is a unified big
data processing engine

00:00:38.790 --> 00:00:41.640
that can work across
your analytic workloads

00:00:41.640 --> 00:00:45.555
>> That doesn't sound very simple.

00:00:45.555 --> 00:00:47.340
>> So let's break it down.

00:00:47.340 --> 00:00:47.730
>> Okay.

00:00:47.730 --> 00:00:49.575
>> So first all let's talk about

00:00:49.575 --> 00:00:52.850
big data processing,
distributed big data.

00:00:52.850 --> 00:00:54.140
So last few years,

00:00:54.140 --> 00:00:55.460
what we've been seeing is a trend of

00:00:55.460 --> 00:00:57.800
enterprises collecting
lots and lots of data.

00:00:57.800 --> 00:00:58.265
>> Sure.

00:00:58.265 --> 00:01:00.170
>> From going from GBs of data,

00:01:00.170 --> 00:01:01.640
today we see an enterprise dealing

00:01:01.640 --> 00:01:04.100
with terabytes and petabytes of data.

00:01:04.100 --> 00:01:05.810
Now there's a problem there.

00:01:05.810 --> 00:01:09.005
The problem is that when you
have such a large scale of data,

00:01:09.005 --> 00:01:11.375
how do you really store that
data, first of all, right?

00:01:11.375 --> 00:01:11.960
>> Okay.

00:01:11.960 --> 00:01:13.945
>> So at the start,

00:01:13.945 --> 00:01:16.580
we started with a single
machine and scaling it

00:01:16.580 --> 00:01:20.300
vertically and having terabytes
of hard disk per data group.

00:01:20.300 --> 00:01:22.180
That vertical scaling was not really

00:01:22.180 --> 00:01:24.295
the answer for storing
distributed data.

00:01:24.295 --> 00:01:27.020
A feasible, a more better solution,

00:01:27.020 --> 00:01:29.270
more resilience solution
was distributed data,

00:01:29.270 --> 00:01:30.830
where we don't keep scaling

00:01:30.830 --> 00:01:34.265
a single machine to take
on more and more data.

00:01:34.265 --> 00:01:37.055
What we do is we
distribute the data across

00:01:37.055 --> 00:01:40.340
n number of smaller machines and
that's how we store big data.

00:01:40.340 --> 00:01:41.870
>> So basically divide and conquer,

00:01:41.870 --> 00:01:43.050
right? We divide the work.

00:01:43.050 --> 00:01:43.620
>> Exactly.

00:01:43.620 --> 00:01:44.025
>> Okay.

00:01:44.025 --> 00:01:46.040
>> So now, we have taken care of

00:01:46.040 --> 00:01:48.935
the problem of storing the
data but that's not all.

00:01:48.935 --> 00:01:51.275
The main problem is
not storage of data,

00:01:51.275 --> 00:01:54.680
the main problem is that I need
to gain insights of this data.

00:01:54.680 --> 00:01:56.105
>> When you process it,

00:01:56.105 --> 00:01:57.590
that's where the value is, right?

00:01:57.590 --> 00:01:59.180
>> Exactly. So processing of

00:01:59.180 --> 00:02:02.340
this distributed data
requires different engines.

00:02:02.340 --> 00:02:07.100
Spark is a big data compute
engine which can work across

00:02:07.100 --> 00:02:12.200
distributed data and compute
and do your workloads on that.

00:02:12.200 --> 00:02:16.715
It's just not a distributed
big data compute engine,

00:02:16.715 --> 00:02:19.010
it's also something
that abstracts about

00:02:19.010 --> 00:02:21.590
the details of distribution from you.

00:02:21.590 --> 00:02:22.880
As a user of Spark,

00:02:22.880 --> 00:02:25.190
you won't have to bother about
all the details of distribution,

00:02:25.190 --> 00:02:27.650
that's the whole beauty around Spark.

00:02:27.650 --> 00:02:31.130
>> So you just give it an
assignment and it'll figure out

00:02:31.130 --> 00:02:33.035
how to distribute the work and

00:02:33.035 --> 00:02:35.420
be done as quickly as
possible, hopefully.

00:02:35.420 --> 00:02:38.840
>> Right. When you are dealing
with such kind of data,

00:02:38.840 --> 00:02:42.320
you don't want to be tied down
to a particular language.

00:02:42.320 --> 00:02:45.470
You're like this is a big
data compute engine and now

00:02:45.470 --> 00:02:48.290
you use the flashy language that
I've invented to process that.

00:02:48.290 --> 00:02:50.480
So Spark does something
very beautiful there.

00:02:50.480 --> 00:02:52.235
Spark gives you a choice of language.

00:02:52.235 --> 00:02:54.680
If you're a Python programmer,

00:02:54.680 --> 00:02:57.350
you can program in
Python, Scala, Java,

00:02:57.350 --> 00:03:01.190
R. R is very popular about
our data scientists and

00:03:01.190 --> 00:03:04.760
Spark gives you the option
to use R for your workloads.

00:03:04.760 --> 00:03:09.050
So that what is Spark as a
distributed Compute Engine.

00:03:09.050 --> 00:03:11.105
>> So basically you said

00:03:11.105 --> 00:03:13.850
restore it differently but
that's something we did for

00:03:13.850 --> 00:03:16.920
big data because of
the problem of scaling

00:03:16.920 --> 00:03:18.450
upwards all the time with scaling

00:03:18.450 --> 00:03:20.415
up and now we're scaling
out as well. Right?

00:03:20.415 --> 00:03:20.790
>> Right.

00:03:20.790 --> 00:03:23.075
>> Then Spark works on
the distributed layer

00:03:23.075 --> 00:03:24.320
and gives you the flexibility of

00:03:24.320 --> 00:03:25.580
choosing the language of choice to.

00:03:25.580 --> 00:03:29.180
>> Yes. There was still a bit
[inaudible] into my sentence.

00:03:29.180 --> 00:03:31.520
I told you something about
a unified Compute Engine.

00:03:31.520 --> 00:03:34.120
So let's slice and dice that a bit.

00:03:34.120 --> 00:03:39.170
As we saw, enterprises started
getting more and more data.

00:03:39.170 --> 00:03:41.420
The traditional
workloads also move from

00:03:41.420 --> 00:03:45.320
transaction workloads to workloads
which are analytic in nature.

00:03:45.320 --> 00:03:47.730
What do we mean by analytic workload?

00:03:47.730 --> 00:03:51.290
Workload where I'm analyzing
a lot of data to get

00:03:51.290 --> 00:03:53.180
insights out of it and

00:03:53.180 --> 00:03:56.645
then maybe doing machine
learning or deep learning.

00:03:56.645 --> 00:03:59.750
So traditionally, the phase from

00:03:59.750 --> 00:04:03.095
the transaction workloads also
moved to analytic workloads.

00:04:03.095 --> 00:04:06.920
Analytic workloads had a variety
of workloads from deep learning,

00:04:06.920 --> 00:04:10.120
machine learning, analytics
and streaming workloads.

00:04:10.120 --> 00:04:12.990
Now, each of these
workloads you don't want

00:04:12.990 --> 00:04:17.120
a separate compute engine to
really build your skills on.

00:04:17.120 --> 00:04:18.530
>> Ideally you would learn one.

00:04:18.530 --> 00:04:20.840
>> Exactly. That's what Spark does.

00:04:20.840 --> 00:04:24.110
Spark is a unified compute
engine that allows you to work

00:04:24.110 --> 00:04:27.690
across all these workloads with
the same set of principles.

00:04:27.690 --> 00:04:29.875
That's what Spark is about.

00:04:29.875 --> 00:04:32.795
A distributed compute
engine that extracts out

00:04:32.795 --> 00:04:35.750
the details of work
distribution from you.

00:04:35.750 --> 00:04:39.545
It doesn't have you bother
about distribution details.

00:04:39.545 --> 00:04:42.230
Second, a unified compute
engine and above all,

00:04:42.230 --> 00:04:45.245
which I find it as a very
powerful developer feature,

00:04:45.245 --> 00:04:47.990
it offers you the choice of language
where you could use Python,

00:04:47.990 --> 00:04:50.300
Scala, Java or R,
whatever you choose from.

00:04:50.300 --> 00:04:52.265
So that what Spark is.

00:04:52.265 --> 00:04:54.725
>> Cool. That's very repressive.

00:04:54.725 --> 00:04:57.785
I mean, Spark that's fine.

00:04:57.785 --> 00:05:00.370
So what are we doing with
Spark on SQL Server?

00:05:00.370 --> 00:05:01.760
Do we have anything?

00:05:01.760 --> 00:05:05.300
>> So Spark is basically ApacheSpark.

00:05:05.300 --> 00:05:07.460
It's open source compute engine.

00:05:07.460 --> 00:05:09.230
What we have done in
big data clusters,

00:05:09.230 --> 00:05:12.680
we have brought you
together, this with SQL,

00:05:12.680 --> 00:05:16.995
as a single unified
offering with SQL and

00:05:16.995 --> 00:05:21.070
end-to-end solution where you
not just get the compute engine,

00:05:21.070 --> 00:05:23.150
you get a complete
end-to-end experience

00:05:23.150 --> 00:05:24.785
on using the compute engine.

00:05:24.785 --> 00:05:26.630
>> So that would mean
that since Spark

00:05:26.630 --> 00:05:28.400
has been integrated
in big data cluster,

00:05:28.400 --> 00:05:31.625
I can query the data in

00:05:31.625 --> 00:05:32.885
the big data cluster

00:05:32.885 --> 00:05:35.060
using old benefits you
just describe from Spark.

00:05:35.060 --> 00:05:36.470
>> Exactly. Using the client

00:05:36.470 --> 00:05:38.270
tooling that you are
all familiar with.

00:05:38.270 --> 00:05:40.330
>> Wow, that's impressive.

00:05:40.330 --> 00:05:43.110
Wow. So how do I learn more?

00:05:43.110 --> 00:05:47.025
I mean, this sounds very new to
me, where do I find something?

00:05:47.025 --> 00:05:49.529
>> Please go to the big
data cluster documentation,

00:05:49.529 --> 00:05:52.910
SQL Server data cluster documentation
and you will find a lot

00:05:52.910 --> 00:05:57.275
of comms and articles about Spark.

00:05:57.275 --> 00:06:00.020
You will find big data clusters,

00:06:00.020 --> 00:06:02.855
Spark examples and the SQL
Server samples repository.

00:06:02.855 --> 00:06:04.660
That's where you get started.

00:06:04.660 --> 00:06:06.920
>> Cool. So we'll
make sure to include

00:06:06.920 --> 00:06:09.080
those links in the description,

00:06:09.080 --> 00:06:10.565
so you will find them there.

00:06:10.565 --> 00:06:13.265
Thanks Shiv for coming to the show.

00:06:13.265 --> 00:06:14.420
Thanks for explaining this.

00:06:14.420 --> 00:06:17.430
I learned, finally, what Spark is.

00:06:17.430 --> 00:06:19.205
Thanks for watching.

00:06:19.205 --> 00:06:21.650
Please like and subscribe and
I hope to see you next time.

00:06:21.650 --> 00:06:33.610
[MUSIC]

