WEBVTT

00:00:00.000 --> 00:00:09.223
[MUSIC]

00:00:13.237 --> 00:00:15.891
Hi, everybody welcome to another
exciting episode of Data Exposed.

00:00:15.891 --> 00:00:16.920
I'm your host Scott Klein and

00:00:16.920 --> 00:00:19.060
back with me by popular
demand [LAUGH] Michael.

00:00:19.060 --> 00:00:19.780
>> [LAUGH]
>> Michael Rys.

00:00:19.780 --> 00:00:20.370
>> Thanks, Scott.

00:00:20.370 --> 00:00:22.000
>> Michael, how are you doing?

00:00:22.000 --> 00:00:23.570
>> Hi guys, yeah so

00:00:23.570 --> 00:00:28.380
I'm back after over half a year
of not talking about U-SQL.

00:00:28.380 --> 00:00:31.100
I'm back to give you an update
of what we have added and

00:00:31.100 --> 00:00:31.660
what we have changed.

00:00:31.660 --> 00:00:34.540
>> Yeah, and we're looking forward
to it cuz U-SQL's a popular topic.

00:00:34.540 --> 00:00:36.400
So before we get started,
we talk about it,

00:00:36.400 --> 00:00:38.490
introduce yourselves real quick for
those who may not know you.

00:00:38.490 --> 00:00:39.830
>> Yeah, so my name is Mike Rys,

00:00:39.830 --> 00:00:43.170
I'm a program manager in
the Microsoft Big Data Team.

00:00:43.170 --> 00:00:47.510
I am responsible for U-SQL and
Azure Data Lake Analytics.

00:00:47.510 --> 00:00:48.240
>> Awesome.

00:00:48.240 --> 00:00:50.082
Okay, so you're here to talk about,

00:00:50.082 --> 00:00:54.294
cuz yeah, it's been probably four to
six months since we had you in last.

00:00:54.294 --> 00:00:54.960
>> Yep.

00:00:54.960 --> 00:00:59.070
>> So what's new since
last time we had you in?

00:00:59.070 --> 00:01:02.790
>> Well, so we made quite a bit of
progress in adding some features

00:01:02.790 --> 00:01:04.163
that people have asked.

00:01:04.163 --> 00:01:04.737
>> Yep.
>> And

00:01:04.737 --> 00:01:09.060
a few features that we thought
might be interesting to add.

00:01:09.060 --> 00:01:11.610
Also added some performance
improvements etc., and

00:01:11.610 --> 00:01:15.380
I'm kind of here to quickly
touch a few of those.

00:01:15.380 --> 00:01:17.014
>> All right.
>> And show you some code and

00:01:17.014 --> 00:01:17.910
samples, etc.

00:01:17.910 --> 00:01:18.750
>> I always like looking at code.

00:01:18.750 --> 00:01:20.160
All right, let's step right in.

00:01:20.160 --> 00:01:20.890
So what do we got?

00:01:20.890 --> 00:01:25.440
>> Okay, so first in terms of
some new features that we have.

00:01:25.440 --> 00:01:27.160
We added security.

00:01:27.160 --> 00:01:27.690
>> Okay.
>> And

00:01:27.690 --> 00:01:32.178
basically ACLing at the folder and
fidle level, file and

00:01:32.178 --> 00:01:34.925
folder level, sorry about that.

00:01:34.925 --> 00:01:36.500
>> [LAUGH]
>> On the store, and

00:01:36.500 --> 00:01:39.900
also the ability to set
permissions on database level.

00:01:39.900 --> 00:01:42.210
So let me quickly show
you that in the portal,

00:01:42.210 --> 00:01:44.670
which is the only way that
it can currently do that.

00:01:44.670 --> 00:01:45.200
>> Okay.
>> So

00:01:45.200 --> 00:01:46.955
here we have our Azure portal and

00:01:46.955 --> 00:01:49.170
I'm opening my
Data Lake account here.

00:01:50.370 --> 00:01:54.530
And then in the Data Lake account
by using the data explorer

00:01:54.530 --> 00:01:56.340
I can see my databases here.

00:01:56.340 --> 00:01:59.010
So if I zoom in quickly,

00:01:59.010 --> 00:02:00.910
you notice here my catalog
>> Yep.

00:02:00.910 --> 00:02:02.930
>> has all these databases.

00:02:02.930 --> 00:02:07.030
So let's now look at
the JSONBlock database here.

00:02:07.030 --> 00:02:13.160
And now you notice up here,
there is a Manage Access property.

00:02:13.160 --> 00:02:17.510
That now allows me to
basically give permissions for

00:02:17.510 --> 00:02:21.340
people to use the database and
do things with it.

00:02:21.340 --> 00:02:26.690
So you see here, at this point,
it's only me that have access.

00:02:26.690 --> 00:02:29.975
So you have an owner, which is the
person that creates the database.

00:02:29.975 --> 00:02:30.600
>> Mm-hm.

00:02:30.600 --> 00:02:33.780
>> You can set permissions
on a per group basis.

00:02:33.780 --> 00:02:35.460
And for everybody else.

00:02:35.460 --> 00:02:39.244
So for example, you could have
a group of people that can for

00:02:39.244 --> 00:02:41.221
example create objects in it.

00:02:41.221 --> 00:02:45.634
Like register assembly, create
a table, insert data, and auto scan.

00:02:45.634 --> 00:02:49.460
Only read from the data
inside the table.

00:02:49.460 --> 00:02:52.737
>> Is this like the other Azure
data store and things like that?

00:02:52.737 --> 00:02:53.727
Does this come off

00:02:53.727 --> 00:02:56.170
the Azure Active Directory
>> Yes.

00:02:56.170 --> 00:02:56.700
>> permissions?

00:02:56.700 --> 00:02:59.280
>> So the security principles
are Azure Active Directory.

00:02:59.280 --> 00:03:02.630
So it's either some login or

00:03:02.630 --> 00:03:06.160
a security group that it can use
as the security principle here.

00:03:06.160 --> 00:03:06.750
>> Okay.
>> And

00:03:06.750 --> 00:03:09.960
the permissions
are database specific.

00:03:09.960 --> 00:03:14.640
So reading, or writing, or
basically enumerating, etc.

00:03:14.640 --> 00:03:16.780
So you see over here kind of
the permissions that they have.

00:03:16.780 --> 00:03:17.340
>> Okay.

00:03:17.340 --> 00:03:18.440
On the database level.

00:03:18.440 --> 00:03:19.630
>> Nice, okay.

00:03:19.630 --> 00:03:22.210
And how long has
that been available?

00:03:22.210 --> 00:03:28.280
>> So that has been available for
about a month now.

00:03:28.280 --> 00:03:29.608
>> Okay.
>> So now the next thing is I

00:03:29.608 --> 00:03:31.780
would like to talk about
performance improvements.

00:03:31.780 --> 00:03:35.660
Now these performance improvements
are not yet in the product.

00:03:35.660 --> 00:03:36.360
>> Okay.
>> But

00:03:36.360 --> 00:03:39.315
should be probably by the time
this video goes live.

00:03:39.315 --> 00:03:41.410
>> [LAUGH]
>> Or shortly there after hopefully.

00:03:41.410 --> 00:03:42.020
>> Hopefully.

00:03:42.020 --> 00:03:44.118
>> Not like a month later.

00:03:44.118 --> 00:03:47.380
And what it is, is it's basically we

00:03:47.380 --> 00:03:50.585
have a lot of people that love
to use our file set feature.

00:03:50.585 --> 00:03:53.885
File set features basically means
you give a path that contains

00:03:53.885 --> 00:03:58.125
wildcards in the path to pick up
all the files that match the path.

00:03:58.125 --> 00:03:59.145
>> Match the path.

00:03:59.145 --> 00:04:00.755
>> But
we have some performance issues.

00:04:00.755 --> 00:04:03.055
And let me quickly show
you an example here.

00:04:04.175 --> 00:04:09.620
So if I go over here
I have Visual Studio.

00:04:09.620 --> 00:04:11.370
I have a simple script here.

00:04:11.370 --> 00:04:14.020
>> Yep.
>> Now, this script is going over

00:04:14.020 --> 00:04:16.230
some of our telemetry data.

00:04:16.230 --> 00:04:16.949
>> Okay.
>> And

00:04:16.949 --> 00:04:19.639
if you look at what
you have here is,

00:04:19.639 --> 00:04:24.770
it basically has a pretty long path
name with a lot of patterns in it.

00:04:24.770 --> 00:04:29.160
So it basically parameterizes the
cluster in which we are looking at.

00:04:29.160 --> 00:04:30.840
The date, the year, month, date.

00:04:30.840 --> 00:04:31.610
>> Yeah, and the name.

00:04:31.610 --> 00:04:33.350
>> And the name of the file.

00:04:33.350 --> 00:04:36.929
And what I do here is I just
basically extract the data out of

00:04:38.200 --> 00:04:40.950
the lines, so
I'm kind of pretty lazy here.

00:04:40.950 --> 00:04:42.440
>> Yeah.
>> And I promote obviously all

00:04:42.440 --> 00:04:45.395
these, what we call virtual columns,
out of the pattern.

00:04:45.395 --> 00:04:45.950
>> Right.
>> And

00:04:45.950 --> 00:04:49.270
then I do some simple
aggregations down there, and

00:04:49.270 --> 00:04:50.660
output them into a file.

00:04:50.660 --> 00:04:53.170
So I don't have to
write out the data.

00:04:53.170 --> 00:04:58.600
Now I'm not going to run that, but
if you run this against the existing

00:04:58.600 --> 00:05:02.850
run time today,
you will notice a few things here.

00:05:02.850 --> 00:05:06.110
Let me quickly zoom in so
you can see the numbers here.

00:05:06.110 --> 00:05:13.000
So first, we have about 2,150
files that we are operating on.

00:05:13.000 --> 00:05:15.010
You see over there
on the right hand.

00:05:15.010 --> 00:05:16.098
>> 2,150.
>> Let me,

00:05:16.098 --> 00:05:18.700
maybe zoom in into that part here.

00:05:18.700 --> 00:05:20.310
So that's basically the input.

00:05:20.310 --> 00:05:23.290
So we operate on over 2,000 files.

00:05:23.290 --> 00:05:23.940
>> Okay.

00:05:23.940 --> 00:05:27.825
>> Now unfortunately,
compilation takes ten minutes.

00:05:27.825 --> 00:05:28.600
>> [LAUGH]
>> At the moment.

00:05:28.600 --> 00:05:31.230
And actually,
if you reach about 5,000 files or

00:05:31.230 --> 00:05:33.190
so we will time you out.

00:05:33.190 --> 00:05:37.580
Because our compilation time
out limit is 24 minutes.

00:05:37.580 --> 00:05:39.620
>> Okay.
>> And then disregard the queuing, I

00:05:39.620 --> 00:05:42.850
had some other stuff running at the
same time when I was doing this job.

00:05:42.850 --> 00:05:46.411
And then you're
running in 35 minutes.

00:05:46.411 --> 00:05:46.932
>> Yeah, woo.

00:05:46.932 --> 00:05:50.264
>> Or 35 and a half minutes to
actually get your stuff done.

00:05:50.264 --> 00:05:50.889
>> Okay.

00:05:50.889 --> 00:05:53.178
>> Yeah, so not very good.

00:05:53.178 --> 00:05:55.150
>> Over 2,000 files that
seems like a long time.

00:05:56.240 --> 00:05:57.570
>> Now what we did was,

00:05:57.570 --> 00:06:02.120
this is now the job with
running on the new bits.

00:06:02.120 --> 00:06:02.660
>> Okay.

00:06:02.660 --> 00:06:07.890
>> And the first thing
you notice is that my

00:06:07.890 --> 00:06:11.530
compilation time has now gone
down by an order of magnitude.

00:06:11.530 --> 00:06:12.830
It's now about a minute or so.

00:06:12.830 --> 00:06:13.640
>> A minute, yep.

00:06:13.640 --> 00:06:17.560
>> And it's most likely going
to scale much better in

00:06:17.560 --> 00:06:18.707
terms of number of files.

00:06:18.707 --> 00:06:19.508
>> Okay.
>> So we expect it to

00:06:19.508 --> 00:06:21.780
be one to two order of
files more than previously.

00:06:21.780 --> 00:06:23.550
>> Wow.

00:06:23.550 --> 00:06:25.990
>> And the other note
as you might notice is,

00:06:25.990 --> 00:06:30.890
again disregard the queuing, the
running was less than ten minutes.

00:06:30.890 --> 00:06:35.700
So the processing of these
2,000 plus files was quite

00:06:35.700 --> 00:06:37.180
a bit faster now.

00:06:37.180 --> 00:06:41.010
Because we also produced
better plans that

00:06:41.010 --> 00:06:43.510
know how to deal with such files.

00:06:43.510 --> 00:06:44.420
>> Okay, good.

00:06:44.420 --> 00:06:47.930
>> So this is just kind of a hint at

00:06:47.930 --> 00:06:49.825
what is coming in
terms of performance.

00:06:49.825 --> 00:06:50.440
>> Good.

00:06:50.440 --> 00:06:53.710
>> Another thing that we
improve is reusing containers.

00:06:53.710 --> 00:06:58.160
So today when you schedule a job,
all these notes that you

00:06:58.160 --> 00:07:02.920
have inside your job graph
basically run in young containers.

00:07:02.920 --> 00:07:08.550
And every time a job starts,
it builds up a new container and

00:07:08.550 --> 00:07:13.225
that takes between half a second
to potentially several seconds.

00:07:13.225 --> 00:07:13.970
>> Right.

00:07:13.970 --> 00:07:17.710
>> And so what we are now
doing in the next release,

00:07:17.710 --> 00:07:20.440
is that we are reusing
the containers where possible.

00:07:20.440 --> 00:07:23.780
So that you are basically not having
that startup time all the time.

00:07:23.780 --> 00:07:26.799
And that should also improve
the performance quite a bit.

00:07:27.930 --> 00:07:28.830
>> That's good to know.

00:07:28.830 --> 00:07:29.893
>> So now in terms of functionality,

00:07:29.893 --> 00:07:31.441
obviously since I haven't
been here for so long.

00:07:31.441 --> 00:07:32.469
>> [LAUGH]
>> We have a lot of

00:07:32.469 --> 00:07:33.089
[CROSSTALK] functionality.

00:07:33.089 --> 00:07:35.760
>> A long list.

00:07:35.760 --> 00:07:39.350
>> Maybe I go and show a few of
those, let me quickly go through.

00:07:39.350 --> 00:07:43.000
So we have added sampling, I will
show you a simple example there.

00:07:43.000 --> 00:07:46.840
But we have also ability
to do uniform sampling and

00:07:46.840 --> 00:07:50.480
even sampling across
correlated datasets.

00:07:50.480 --> 00:07:54.310
So that if you do drawings, you get
actually a statistically significant

00:07:54.310 --> 00:07:56.765
sample out of the the two
joint partners.

00:07:56.765 --> 00:07:57.610
>> Right.

00:07:57.610 --> 00:08:02.270
>> We also the ability
to do PRESORT or REDUCE.

00:08:02.270 --> 00:08:05.330
I will show you quickly the code,
how to call it.

00:08:05.330 --> 00:08:10.190
I have a blog post on my MSDM
blog that actually shows how

00:08:10.190 --> 00:08:13.090
to write the reducer as well.

00:08:13.090 --> 00:08:18.240
We added some additional
variable declaration options

00:08:18.240 --> 00:08:20.845
that we can deal with
parameterization and

00:08:20.845 --> 00:08:23.705
constant folding,
I will show that in a second.

00:08:23.705 --> 00:08:25.563
And we added IF Then Else.

00:08:25.563 --> 00:08:28.127
>> Yep [LAUGH].

00:08:28.127 --> 00:08:28.985
Yea!

00:08:28.985 --> 00:08:30.465
>> However, it's only compile time.

00:08:30.465 --> 00:08:31.245
>> Okay.
>> It basically gives

00:08:31.245 --> 00:08:33.355
you the ability to
parametrize your script.

00:08:33.355 --> 00:08:33.930
>> Okay.
>> And

00:08:33.930 --> 00:08:37.815
then see if you want to execute,
let's say,

00:08:37.815 --> 00:08:40.933
the debug version, or the production
version of your script, for example.

00:08:40.933 --> 00:08:42.860
>> Okay.
Still useful though.

00:08:42.860 --> 00:08:47.200
>> Yes, you can also use
FILE EXISTS or PARTITION EXISTS

00:08:47.200 --> 00:08:51.550
inside the IF THEN ELSE expression,
or in any other Boolean context.

00:08:51.550 --> 00:08:53.492
>> Okay.
>> To check for example,

00:08:53.492 --> 00:08:56.323
does the file or
partition already exist.

00:08:56.323 --> 00:08:59.490
And if It does,
then let's execute something.

00:08:59.490 --> 00:09:01.040
Or if not,
then execute something else.

00:09:01.040 --> 00:09:02.150
>> That's great.
>> So we'll show you that in

00:09:02.150 --> 00:09:02.830
a minute as well.

00:09:02.830 --> 00:09:04.080
>> Okay.

00:09:04.080 --> 00:09:06.680
Skip first n rows, this has been
asked for for a while, I think.

00:09:06.680 --> 00:09:07.920
>> Yes, yes, yes, and

00:09:07.920 --> 00:09:11.252
we actually were hoping that we
would have it out a long time ago.

00:09:11.252 --> 00:09:12.245
>> [LAUGH]
>> But it took us,

00:09:12.245 --> 00:09:13.890
unfortunately, a bit longer.

00:09:13.890 --> 00:09:14.770
And so now it's there.

00:09:14.770 --> 00:09:15.607
>> Okay.
>> So we'll show you that

00:09:15.607 --> 00:09:16.169
in a minute as well.

00:09:16.169 --> 00:09:18.420
>> Woo, okay.

00:09:18.420 --> 00:09:20.650
>> And
using statement to shorten C# names.

00:09:20.650 --> 00:09:23.150
>> Okay.
>> I will show you that quickly.

00:09:23.150 --> 00:09:26.980
And then last but not least,
we have alter table adding and

00:09:26.980 --> 00:09:28.090
removing columns.

00:09:28.090 --> 00:09:29.491
>> Okay.
>> That gives you the ability to do

00:09:29.491 --> 00:09:31.490
a little bit of schema
evolution on your tables.

00:09:31.490 --> 00:09:32.560
>> Wonderful.
>> As well.

00:09:32.560 --> 00:09:34.950
>> Good.
>> I won't show you that though.

00:09:34.950 --> 00:09:36.397
>> That's okay.
>> You can look it up in the release

00:09:36.397 --> 00:09:37.598
notes that we've published a bit.

00:09:37.598 --> 00:09:41.017
>> Okay.
>> Okay, so let's go back to my code

00:09:41.017 --> 00:09:45.633
here and let's look at
a second script that I have.

00:09:45.633 --> 00:09:49.318
Now what this script does is it
declares some variables, and

00:09:49.318 --> 00:09:51.275
I will get into that in a second.

00:09:51.275 --> 00:09:52.862
>> Mm-hm.

00:09:52.862 --> 00:09:54.800
>> Because there
are different options and

00:09:54.800 --> 00:09:56.681
I want to quickly show
you what they do.

00:09:56.681 --> 00:09:59.712
And then down here I
have an IF statement.

00:09:59.712 --> 00:10:04.472
And what IF statement does is it
actually checks does the file exist

00:10:04.472 --> 00:10:07.617
in the store when
the script gets compiled.

00:10:07.617 --> 00:10:09.279
>> Yeah, this is one of the examples
you gave on the slide, right?

00:10:09.279 --> 00:10:14.260
>> Yes, and if it exists,
then I do an extraction here.

00:10:14.260 --> 00:10:18.899
Now this extraction operates
on some car telemetry data.

00:10:18.899 --> 00:10:23.839
And so, what the data actually
contains, let me show you that

00:10:23.839 --> 00:10:29.049
quickly, Is two header rows.

00:10:31.040 --> 00:10:34.630
So you see here, that's some car
telemetry from line on downwards.

00:10:34.630 --> 00:10:39.132
But the first line just gives
me some recording information.

00:10:39.132 --> 00:10:43.199
And then the second line gives me
the actual header row and so on.

00:10:43.199 --> 00:10:45.929
And so obviously I would
like to skip those two, so

00:10:45.929 --> 00:10:47.361
that my extractor works.

00:10:50.286 --> 00:10:55.948
So I'm using skipFirstNRows:2.

00:10:55.948 --> 00:11:00.877
Then I do some calculations here,
just some grouping

00:11:00.877 --> 00:11:05.940
to see some information, and
then at the end I output.

00:11:05.940 --> 00:11:07.880
If the file does not exist,

00:11:07.880 --> 00:11:13.200
I just create a pseudo row here
that says, file not found.

00:11:13.200 --> 00:11:16.334
And I do the same output here
by cheating on the name of

00:11:16.334 --> 00:11:17.410
the column here.

00:11:17.410 --> 00:11:19.328
>> [LAUGH]
>> So, I hope that works.

00:11:19.328 --> 00:11:23.576
Okay, so if I execute this now,
well, before I execute,

00:11:23.576 --> 00:11:27.930
let me quickly declare,
explain the declaration up here.

00:11:27.930 --> 00:11:31.700
So, this declare statement
says DECLARE EXTERNAL name.

00:11:31.700 --> 00:11:36.380
Now what this does, is it allows me
to add another declaration statement

00:11:36.380 --> 00:11:41.180
or a parameter to the script.

00:11:41.180 --> 00:11:43.424
But if there's no
parameter provided,

00:11:43.424 --> 00:11:46.332
then the script still runs
with this default value.

00:11:46.332 --> 00:11:49.692
So it doesn't, because today if I
have two declare statements for

00:11:49.692 --> 00:11:52.090
the same variable,
we basically error.

00:11:52.090 --> 00:11:53.610
And in this case, we basically say,

00:11:53.610 --> 00:11:56.790
well, actually this is kind of
the default for the script.

00:11:56.790 --> 00:11:59.750
And if you want to provide it with
some parameter mechanism, for

00:11:59.750 --> 00:12:02.785
example, Azure Data Factory
as a parameter model,

00:12:02.785 --> 00:12:04.620
where they prepended
the class statements.

00:12:04.620 --> 00:12:08.355
So, that would give me the ability
to default this script and

00:12:08.355 --> 00:12:11.595
then still parametrize it for
example through ADF or

00:12:11.595 --> 00:12:13.805
some other submission tool.

00:12:13.805 --> 00:12:17.623
The fifth line here
says DECLARE CONST.

00:12:17.623 --> 00:12:21.483
What this does is, it actually
checks if the expression that I

00:12:21.483 --> 00:12:25.430
provide after is constant foldable,
why's that important?

00:12:25.430 --> 00:12:30.200
Well, constant foldable is something
that we can evaluate at compile

00:12:30.200 --> 00:12:31.220
time, and

00:12:31.220 --> 00:12:35.300
we have a few contexts where we
allow you to put in expressions.

00:12:35.300 --> 00:12:38.860
That either have to be
constant foldable, like for

00:12:38.860 --> 00:12:41.810
example the from
clause in an extract.

00:12:41.810 --> 00:12:46.190
Or I have special optimizations
that will happen.

00:12:47.790 --> 00:12:51.580
But it will still execute even
if it's not constant foldable.

00:12:51.580 --> 00:12:54.810
So this gives me the ability
to assert that this expression

00:12:54.810 --> 00:12:55.830
is constant foldable.

00:12:55.830 --> 00:12:57.028
Now, in this case,

00:12:57.028 --> 00:13:01.617
since I'm doing string concatenation
with an anonymous function in there.

00:13:01.617 --> 00:13:03.347
>> [LAUGH]
>> This is not going to be

00:13:03.347 --> 00:13:05.943
constant foldable, so
if I compile this.

00:13:08.872 --> 00:13:13.118
And I get the syntax error here
because I'm running on a not yet

00:13:13.118 --> 00:13:14.840
released runtime here.

00:13:14.840 --> 00:13:18.392
So my local tool is
not up to date yet.

00:13:18.392 --> 00:13:22.376
>> [LAUGH]
>> It was noticed after about six to

00:13:22.376 --> 00:13:29.512
ten seconds the compiler will be
complaining on this, let's see.

00:13:32.613 --> 00:13:34.850
>> So, at this point,
yep, there you go.

00:13:34.850 --> 00:13:37.600
>> Yep, so
let me go through the errors here.

00:13:38.740 --> 00:13:42.850
And if I zoom in,
you will notice that it now says,

00:13:42.850 --> 00:13:46.850
the expression cannot be
evaluated at compile time.

00:13:46.850 --> 00:13:47.447
>> Can't be folded.
And

00:13:47.447 --> 00:13:52.330
it's here at the DECLARE CONST
expression where this happens.

00:13:53.800 --> 00:13:58.458
So, what I can do now is,
I can go back of course,

00:13:58.458 --> 00:14:04.490
just fix this, And
fix it with this statement.

00:14:04.490 --> 00:14:08.470
Now, the default that I have here,
that file does not exist.

00:14:08.470 --> 00:14:15.587
So if I execute this job now,
It will go and run and

00:14:15.587 --> 00:14:21.032
create the file just telling me
that the file doesn't exist.

00:14:21.032 --> 00:14:27.234
>> Okay, So
this will be taking a little bit.

00:14:29.174 --> 00:14:32.710
What code were we looking at?

00:14:32.710 --> 00:14:33.910
>> We were looking at this here.

00:14:33.910 --> 00:14:38.030
>> Okay, so the difference,
as I was saying that the line 5

00:14:38.030 --> 00:14:42.020
at this point can't be folded, so
cuz it couldn't find the name?

00:14:42.020 --> 00:14:44.408
>> No, so constant folding means
that it can be evaluated or

00:14:44.408 --> 00:14:45.019
compiled on.

00:14:45.019 --> 00:14:48.533
So that we are basically kind
of doing some evaluation of

00:14:48.533 --> 00:14:49.890
the expressions.

00:14:49.890 --> 00:14:52.018
Similar to what C# does for example,

00:14:52.018 --> 00:14:54.288
if you add two constant
values with C#.

00:14:54.288 --> 00:14:57.846
>> And so, in this case obviously
this lambda expression is not

00:14:57.846 --> 00:15:01.826
constant foldable because our
constant folder is not clever enough

00:15:01.826 --> 00:15:02.819
to look into it.

00:15:02.819 --> 00:15:06.251
While a simple string
concatenation is supported,

00:15:06.251 --> 00:15:08.981
actually all the string
operations are,

00:15:08.981 --> 00:15:11.733
that's why I had to like such a-
>> All right, yep,

00:15:11.733 --> 00:15:12.559
that makes sense now.

00:15:12.559 --> 00:15:15.719
>> All right, cool, So let's go
back and see what the job does.

00:15:17.110 --> 00:15:17.631
So it's running now.

00:15:20.490 --> 00:15:22.140
Probably already finished, yeah.

00:15:23.910 --> 00:15:27.015
So now,
if I open my header file here,

00:15:27.015 --> 00:15:31.592
download it You will

00:15:31.592 --> 00:15:36.562
notice, That I'm getting the-
>> Not found.

00:15:36.562 --> 00:15:37.620
>> File not found.

00:15:38.680 --> 00:15:43.920
And if I go back now and
I change this

00:15:43.920 --> 00:15:48.450
and add this additional DECLARE

00:15:48.450 --> 00:15:51.640
statements before the external
declaration for the same variable.

00:15:52.770 --> 00:15:54.420
And I submit this now as well,

00:15:55.450 --> 00:15:57.770
at that point in time
it will actually work.

00:15:57.770 --> 00:16:00.210
>> I see.
>> So, maybe while it's running,

00:16:00.210 --> 00:16:01.850
let me quickly show some
of the other stuff.

00:16:04.520 --> 00:16:10.210
So, one thing that I wanted to
show was the range, the PRESORT.

00:16:10.210 --> 00:16:14.540
So, in this case here I
have a U-SQL script that

00:16:14.540 --> 00:16:16.780
takes some data of ranges.

00:16:16.780 --> 00:16:22.340
And I need to kind of collapse
the ranges that are overlapping.

00:16:22.340 --> 00:16:26.940
And so, in order to be able to do
that, I write a custom aggregator.

00:16:26.940 --> 00:16:31.492
But the custom aggregator, the best
way to do that is, if I make it,

00:16:31.492 --> 00:16:34.400
basically I have to
input data sorted.

00:16:34.400 --> 00:16:38.594
So then I can just look at the next
row to see if I have to include it

00:16:38.594 --> 00:16:42.174
in the previous interval or
if I start a new interval.

00:16:42.174 --> 00:16:46.647
And so, PRESORT basically now
gives me the ability to create

00:16:46.647 --> 00:16:49.650
user-defined ordered aggregations.

00:16:49.650 --> 00:16:55.320
So, this is just an example, you
actually have this on my MSDN blog.

00:16:55.320 --> 00:16:59.260
Available if you want to also
look how the code looks like, and

00:16:59.260 --> 00:17:01.270
it's downloadable from
our backup site as well.

00:17:02.400 --> 00:17:05.800
So, then the second
thing I wanted to show

00:17:05.800 --> 00:17:07.799
quickly is the USING statement.

00:17:09.230 --> 00:17:13.230
Again, this is code that we
can download from our GitHub.

00:17:14.670 --> 00:17:19.570
So, what I do here is I reference
a SQL Server spatial assembly

00:17:19.570 --> 00:17:22.010
that I downloaded and installed.

00:17:22.010 --> 00:17:24.880
And also already predefined and

00:17:24.880 --> 00:17:29.220
preloaded in our runtime is
the System.Data assembly.

00:17:29.220 --> 00:17:31.788
So now I, instead of having to
write Microsoft SQL server types.

00:17:31.788 --> 00:17:35.734
>> [LAUGH]
>> SQL geometry or geography,

00:17:35.734 --> 00:17:39.181
I just use the USING statement
to shorten that, and

00:17:39.181 --> 00:17:42.243
then I can use that
inside my C# expression.

00:17:42.243 --> 00:17:44.380
>> So it makes the experience
much better in this case?

00:17:44.380 --> 00:17:46.322
>> Yes,
similar to the using clause in C#,

00:17:46.322 --> 00:17:48.773
except that it's upper case
instead of lower case.

00:17:48.773 --> 00:17:51.057
>> Okay, yep, wonderful.

00:17:51.057 --> 00:17:56.300
>> Now, let's go back to our job
that obviously has now completed.

00:17:56.300 --> 00:17:59.779
As you can see we are writing
a little bit more data here, so

00:17:59.779 --> 00:18:01.202
if I download the file.

00:18:06.711 --> 00:18:11.142
Now you see that I did-
>> Okay, did you get some data?

00:18:11.142 --> 00:18:16.447
>> Five laps, this is the duration
of each lap in milliseconds or

00:18:16.447 --> 00:18:19.890
so, this is the highest
RPM that I got.

00:18:19.890 --> 00:18:21.666
This is also the highest
speed that I got, so

00:18:21.666 --> 00:18:24.078
at one point in kilometers per hour,
not in miles per hour.

00:18:24.078 --> 00:18:25.125
>> [LAUGH].

00:18:25.125 --> 00:18:30.708
And so the fastest I was, was like
200 kilometres an hour and 196.

00:18:30.708 --> 00:18:32.534
And then there was probably,

00:18:32.534 --> 00:18:36.660
I had to drive behind some other
guy that was only like 150 or so.

00:18:36.660 --> 00:18:39.654
>> Okay, well 200 kilometers
an hour, still about 100 and-

00:18:39.654 --> 00:18:40.271
>> 25 miles.

00:18:40.271 --> 00:18:41.903
>> 25 miles, okay.

00:18:41.903 --> 00:18:43.551
That's still clipping
along pretty good.

00:18:43.551 --> 00:18:44.287
>> And that was on a race track.

00:18:44.287 --> 00:18:45.263
>> Okay.
[LAUGH]

00:18:45.263 --> 00:18:46.279
>> And not on public roads.

00:18:46.279 --> 00:18:47.303
>> That was on 405. [LAUGHS]

00:18:47.303 --> 00:18:48.575
>> No, no, no, no, no.

00:18:48.575 --> 00:18:50.702
I don't condone that, drive normal.

00:18:50.702 --> 00:18:51.472
>> Very good.

00:18:51.472 --> 00:18:54.991
[LAUGH]
>> Okay, so this just shows you how

00:18:54.991 --> 00:19:00.010
we can now basically use the IF THEN
ELSE statement and to skip header.

00:19:00.010 --> 00:19:04.925
Because now obviously I skipped over
those two headers without having

00:19:04.925 --> 00:19:08.756
to write a custom extractor or
say silent equals true or

00:19:08.756 --> 00:19:16.240
something Okay, so going back now,
here I think I shown most of it.

00:19:16.240 --> 00:19:20.130
Now, supportability's also something
that we added new capabilities.

00:19:20.130 --> 00:19:21.320
In Visual Studio now,

00:19:21.320 --> 00:19:24.790
if you have a runtime error,
there is a bar appearing on top,

00:19:24.790 --> 00:19:29.690
that allows you to download the
failed vertex, and locally debug it.

00:19:29.690 --> 00:19:33.685
So, if you have user code, let's say
you write your own extractor, or

00:19:33.685 --> 00:19:35.758
you write your complicated C# UDF.

00:19:35.758 --> 00:19:39.881
And you run into an issue
because of, I don't know,

00:19:39.881 --> 00:19:45.710
memory overflow, invalid numbers
of columns, cast errors or so.

00:19:45.710 --> 00:19:48.230
You can now download the vertex and

00:19:48.230 --> 00:19:51.850
look at it locally in your local
debugger, inside Visual Studio.

00:19:51.850 --> 00:19:52.670
>> And see where it-
>> And debug it and

00:19:52.670 --> 00:19:53.265
see what happens.

00:19:53.265 --> 00:19:54.913
So that's another very cool feature.

00:19:54.913 --> 00:19:55.594
>> Very nice.

00:19:55.594 --> 00:19:59.084
>> And I hope that we can invite One
of the guys of the Visual Studio

00:19:59.084 --> 00:20:01.724
team could do a video at some point. a

00:20:01.724 --> 00:20:02.220
>> That'd be interesting.

00:20:02.220 --> 00:20:03.620
Okay, yep, we'll plan on doing that.

00:20:03.620 --> 00:20:05.692
>> Okay,
now a little bit more serious.

00:20:05.692 --> 00:20:06.910
So, we had the good news.

00:20:06.910 --> 00:20:08.470
Now the bad news [LAUGH].

00:20:08.470 --> 00:20:10.523
>> It's still good news long term,
but

00:20:10.523 --> 00:20:14.102
it means that people that actually
are using new SQL right now may

00:20:14.102 --> 00:20:15.971
have to go and fix their scripts.

00:20:15.971 --> 00:20:16.670
>> Okay.

00:20:16.670 --> 00:20:19.909
>> So over the time of
our features development,

00:20:19.909 --> 00:20:23.146
we have a couple of things
that we notice need to

00:20:23.146 --> 00:20:26.481
be improved because it
was hard to understand.

00:20:26.481 --> 00:20:30.732
Because it was misleading,
somewhat incorrect, etc.

00:20:30.732 --> 00:20:35.044
The first one is that people got
confused with our partition syntax

00:20:35.044 --> 00:20:37.050
on table creation statement.

00:20:37.050 --> 00:20:39.175
So we were saying
PARTITIONED BY HASH and

00:20:39.175 --> 00:20:42.220
then we also said
PARTITIONED BY BUCKET.

00:20:42.220 --> 00:20:43.845
And it wasn't clear how they relate.

00:20:43.845 --> 00:20:44.676
>> Okay.
>> And so

00:20:44.676 --> 00:20:48.742
we are now making the terminology
much, much more precise.

00:20:48.742 --> 00:20:49.414
>> Okay.

00:20:49.414 --> 00:20:51.369
>> So now if we say partition,

00:20:51.369 --> 00:20:54.687
partitions are things
that are addressable.

00:20:54.687 --> 00:20:59.310
That used to be called
PARTITIONED BY BUCKET.

00:20:59.310 --> 00:21:02.818
In addition, the thing that we used
to call hash partition are now hash

00:21:02.818 --> 00:21:06.624
distribution or range distribution,
which is basically the distribution

00:21:06.624 --> 00:21:10.240
of the data within a table or
within a individual partition.

00:21:10.240 --> 00:21:14.190
And so that is now very
clearly denoted in the syntax.

00:21:14.190 --> 00:21:17.237
So if you have been using
PARTITIONED BY, please, please,

00:21:17.237 --> 00:21:19.524
go and change your code
into DISTRIBUTED BY.

00:21:19.524 --> 00:21:20.400
>> Okay.
>> Also,

00:21:20.400 --> 00:21:25.144
if you use PARTITIONED BY BUCKET,
then go and use PARTITIONED BY,

00:21:25.144 --> 00:21:27.194
DISTRIBUTED BY, instead.

00:21:27.194 --> 00:21:30.194
That is really,
really important, because soon,

00:21:30.194 --> 00:21:32.660
we will switch off the old syntax,
right?

00:21:32.660 --> 00:21:37.688
Now we support both,
but, By late September,

00:21:37.688 --> 00:21:42.454
early October, most likely,
we will switch the old syntax off.

00:21:42.454 --> 00:21:45.121
And, you will have to go and
change your script.

00:21:45.121 --> 00:21:46.442
>> Okay, yeah.

00:21:46.442 --> 00:21:49.839
>> And note it does not affect
anything on the data, so

00:21:49.839 --> 00:21:53.947
you don't have to go and
rerun your script, the meta-data,

00:21:53.947 --> 00:21:56.724
the data underneath
is exactly the same.

00:21:56.724 --> 00:21:58.921
It's just that the syntax
is changing, so

00:21:58.921 --> 00:22:01.523
you don't have to do anything.

00:22:01.523 --> 00:22:07.061
And similarly we now are going
to require to use the official

00:22:07.061 --> 00:22:11.954
indication for
24 hours on the file set patterns.

00:22:11.954 --> 00:22:18.448
So we currently support both
lower case h and upper case H.

00:22:18.448 --> 00:22:23.524
But in C#, lower case h in
pattern language means it is for

00:22:23.524 --> 00:22:25.606
12 hour clocks only.

00:22:25.606 --> 00:22:29.282
And so we are going to now basically
deprecate the lower case h and

00:22:29.282 --> 00:22:31.268
will require only upper-case H.

00:22:31.268 --> 00:22:34.050
>> Upper case h, okay.

00:22:34.050 --> 00:22:39.490
>> Another thing on the file
set size, we had two ways of

00:22:39.490 --> 00:22:43.150
basically giving a wild
card on a name and

00:22:43.150 --> 00:22:45.654
that is of string type or
in text or so.

00:22:45.654 --> 00:22:49.326
One was, the col;*.

00:22:49.326 --> 00:22:49.904
>> Yep.
>> And

00:22:49.904 --> 00:22:51.293
the other one was just the col name.

00:22:51.293 --> 00:22:51.981
>> Okay.

00:22:51.981 --> 00:22:53.279
>> And that was confusing.

00:22:53.279 --> 00:22:55.576
They had slightly
different semantics, but

00:22:55.576 --> 00:22:58.909
it was very confusing for people
to understand when to use which.

00:22:58.909 --> 00:23:03.898
So we are going to basically
remove the codename col*,

00:23:03.898 --> 00:23:08.780
and I'll make the semantics
of the one which just has to

00:23:08.780 --> 00:23:13.900
col name to be equivalent to
the one with the star before.

00:23:13.900 --> 00:23:16.717
So please go and
change it again because [INAUDIBLE]-

00:23:16.717 --> 00:23:19.427
>> [CROSSTALK]

00:23:19.427 --> 00:23:22.676
>> Script will stop working when you

00:23:22.676 --> 00:23:24.189
do not do that.

00:23:24.189 --> 00:23:25.997
Okay, That's it.

00:23:25.997 --> 00:23:26.580
>> Okay, awesome.

00:23:26.580 --> 00:23:28.741
So good information, very nice.

00:23:28.741 --> 00:23:30.389
Michael, thanks for coming.

00:23:30.389 --> 00:23:34.115
If they have any questions,
if the viewers have any questions,

00:23:34.115 --> 00:23:37.360
what is the best way to get
a hold of you, ask questions?

00:23:37.360 --> 00:23:40.872
>> The best way is if you're
external, you can tweet at me and

00:23:40.872 --> 00:23:42.458
I should be able to see it.

00:23:42.458 --> 00:23:43.626
>> Okay.

00:23:43.626 --> 00:23:47.462
>> Otherwise leave a comment along,
below that video here.

00:23:47.462 --> 00:23:49.510
>> Yep, in the blog,
yep, leave a comment.

00:23:49.510 --> 00:23:51.766
>> Or
connect me through my blog or so.

00:23:51.766 --> 00:23:52.734
>> Okay, perfect.

00:23:52.734 --> 00:23:54.830
All right, cuz I'm sure
there'll be some questions,

00:23:54.830 --> 00:23:57.298
especially around the late
September early October time frame.

00:23:57.298 --> 00:24:00.628
Or over the next just say, 30 days
as people start migrating over.

00:24:00.628 --> 00:24:04.122
All right, so
hit him up at his Twitter account,

00:24:04.122 --> 00:24:06.773
his Twitter handle, or on his blog.

00:24:06.773 --> 00:24:08.920
And I'm sure Mike will be
happy to help you out.

00:24:08.920 --> 00:24:12.241
Everybody, thanks for watching,
and we will see you next time.

00:24:12.241 --> 00:24:22.241
[MUSIC]

