WEBVTT
00:00:04.551 --> 00:00:09.251
Okay, so let's start with
a quick recap of what we were
00:00:09.252 --> 00:00:12.224
doing last time and
then carry on.
00:00:12.225 --> 00:00:14.660
I'll tell you what
we'll do today.
00:00:14.661 --> 00:00:16.860
We proved certain things.
00:00:16.861 --> 00:00:21.600
We proved that the volume of the
d-dimensional sphere of points
00:00:21.601 --> 00:00:23.179
of length less than or
00:00:23.180 --> 00:00:26.350
equal to 1 goes down
as d goes to infinity.
00:00:26.351 --> 00:00:28.671
And in fact, they're asymptotic.
00:00:28.672 --> 00:00:32.747
There's some constant to the d
divided by d to the d over 2.
00:00:32.748 --> 00:00:34.200
And so it goes to 0.
00:00:34.201 --> 00:00:36.305
And to do this,
we use the Gaussian integral.
00:00:36.306 --> 00:00:41.340
And I won't go through
the proof again, but
00:00:41.341 --> 00:00:44.240
we needed two properties of
the Gaussian or the integral.
00:00:44.241 --> 00:00:46.247
So one was radial symmetry.
00:00:46.248 --> 00:00:49.240
So it depended only on r,
the distance from the origin.
00:00:49.241 --> 00:00:52.920
And two was it splits
into the coordinates.
00:00:52.921 --> 00:00:55.527
I mean, that's only at
a Gaussian that can do that.
00:00:55.528 --> 00:00:59.031
So even though it looked like
I pulled this function out of
00:00:59.032 --> 00:01:01.230
a hat, it was for
a reason, right?
00:01:01.231 --> 00:01:03.367
And we proved also that for
00:01:03.368 --> 00:01:07.167
a uniform random point
x from the unit sphere.
00:01:07.168 --> 00:01:12.055
We have that the probability
that the length of x1, the first
00:01:12.056 --> 00:01:17.037
component, this is the distance
from the equator if you will,
00:01:17.038 --> 00:01:19.387
is greater than c over root d.
00:01:19.388 --> 00:01:22.131
Goes down as e to
the -c squared.
00:01:22.132 --> 00:01:23.520
Okay, so we proved that.
00:01:23.521 --> 00:01:27.840
So here's a little sort of
generous thing I didn't state
00:01:27.841 --> 00:01:28.741
last time.
00:01:28.742 --> 00:01:33.588
But many quantities, like
the absolute value of x1, right,
00:01:33.589 --> 00:01:36.972
the first coordinate,
is a random variable
00:01:36.973 --> 00:01:41.011
that x has picked uniformly
at random from the unit.
00:01:41.012 --> 00:01:46.108
For such random variables
have very well-behaved tails.
00:01:46.109 --> 00:01:49.362
They're exponentially
dropping tails.
00:01:49.363 --> 00:01:52.920
That is to say they drop off
exponentially as we move
00:01:52.921 --> 00:01:55.500
to multiples of
the expected value.
00:01:55.501 --> 00:01:59.476
So as soon as you take twice,
thrice or c times the expected
00:01:59.477 --> 00:02:03.873
value, it goes down as e to the
-c or e to the -c squared, okay?
00:02:03.874 --> 00:02:08.221
Often e to the minus c squared.
00:02:08.222 --> 00:02:11.004
So the e to the minus x squared
is an important function,
00:02:11.005 --> 00:02:12.790
because it's a Gaussian, right?
00:02:12.791 --> 00:02:14.880
And we'll see that
several times.
00:02:14.881 --> 00:02:17.797
That'll come up.
00:02:17.798 --> 00:02:21.530
Okay, so I sometimes call this
the law of large dimensions.
00:02:21.531 --> 00:02:22.666
It's like the law
of large numbers.
00:02:22.667 --> 00:02:26.561
But basically, having many
dimensions is sort of
00:02:26.562 --> 00:02:30.743
like having many independent
variables sometimes.
00:02:30.744 --> 00:02:33.439
So it behaves like that.
00:02:33.440 --> 00:02:35.090
Now, 2 gives another proof of 1.
00:02:35.091 --> 00:02:38.660
And also leads to some puzzles
that we discussed a little bit,
00:02:38.661 --> 00:02:39.786
but it's worth understanding.
00:02:39.787 --> 00:02:44.350
So by union bound,
we saw that at least half
00:02:44.351 --> 00:02:47.672
the volume of the sphere
lies in a cube of this side.
00:02:47.673 --> 00:02:52.160
And we draw a picture and the
picture looks sort of confusing
00:02:52.161 --> 00:02:57.220
in a way, because all I am
saying is here is a cube and
00:02:57.221 --> 00:02:59.530
half the volume of
the cube is here.
00:02:59.531 --> 00:03:02.650
But I already prove that
half the volume of the cube
00:03:02.651 --> 00:03:04.650
is also there, right?
00:03:04.651 --> 00:03:07.463
So apparently, they don't seem
to intersect but they do.
00:03:07.464 --> 00:03:08.871
Of course, they intersect.
00:03:08.872 --> 00:03:10.196
Both facts are true.
00:03:10.197 --> 00:03:14.344
And you should just write down,
algebraically,
00:03:14.345 --> 00:03:19.687
what happens and you'll see that
that's not a problem, okay?
00:03:19.688 --> 00:03:21.368
Okay, so
2 also implies the equator
00:03:21.369 --> 00:03:23.539
also implied that samples x and
y from the sphere
00:03:23.540 --> 00:03:26.250
are merely orthogonal
if they're independent.
00:03:26.251 --> 00:03:28.830
And we prove that for
many samples, okay?
00:03:30.190 --> 00:03:33.000
And we also saw how
to generate uniform
00:03:33.001 --> 00:03:35.686
random points from
the hypersphere using Gaussians.
00:03:35.687 --> 00:03:39.860
The end short to that was
rejection sampling will not be
00:03:39.861 --> 00:03:43.750
good, right, because the cube
is much larger than the sphere.
00:03:43.751 --> 00:03:46.270
In fact, that's a general
phenomenon in high dimensions.
00:03:46.271 --> 00:03:48.960
If you want to do uniform
random samples from a lot
00:03:48.961 --> 00:03:52.882
of other things,
which you got here, right,
00:03:52.883 --> 00:03:54.540
you cannot do
rejection sampling.
00:03:55.540 --> 00:04:00.030
In general, in d-dimensions, you
cannot enclose even a sphere.
00:04:00.031 --> 00:04:02.640
Let alone, other objects
in some enclosing thing,
00:04:02.641 --> 00:04:06.840
which is nice and
not much greater volume, okay?
00:04:06.841 --> 00:04:09.820
In two or three dimensions,
you can put in a square or cube.
00:04:09.821 --> 00:04:12.370
And from which you know how
to draw samples because of
00:04:12.371 --> 00:04:15.500
independent coordinates, but
you can do that in either way.
00:04:15.501 --> 00:04:17.800
Okay, so that's the recap.
00:04:17.801 --> 00:04:21.060
Now, today,
I'm gonna start today's lecture.
00:04:21.061 --> 00:04:24.782
Today, basically,
I'm going to prove
00:04:24.783 --> 00:04:29.934
a concentration of results for
some random variables.
00:04:29.935 --> 00:04:31.064
It's partly pedagogically.
00:04:31.065 --> 00:04:33.778
I thought a lot about what's
the right result to prove,
00:04:33.779 --> 00:04:36.318
from which I can derive all
the results that we need,
00:04:36.319 --> 00:04:37.496
like Chernoff bounds.
00:04:37.497 --> 00:04:41.084
I'll tell you what they are,
if you're not familiar Chernoff
00:04:41.085 --> 00:04:43.460
bounds and
other concentration things.
00:04:43.461 --> 00:04:44.653
So I'm going to tell you that.
00:04:44.654 --> 00:04:45.919
But first, let's just start with
00:04:45.920 --> 00:04:47.650
the central limit theorem,
right?
00:04:47.651 --> 00:04:48.840
I'm not gonna prove this.
00:04:48.841 --> 00:04:50.867
I mean, you've probably
all seen a proof.
00:04:50.868 --> 00:04:53.819
But at least,
you should know the statement.
00:04:53.820 --> 00:04:56.164
So if you have independent
identical random,
00:04:56.165 --> 00:04:59.360
identically distributed,
they don't have to be identical.
00:04:59.361 --> 00:05:01.030
That's not that important.
00:05:01.031 --> 00:05:03.020
Independent is important for
this.
00:05:03.021 --> 00:05:04.760
The expected value of 0,
each of them.
00:05:04.761 --> 00:05:06.740
The variance is sigma squared.
00:05:08.670 --> 00:05:10.553
Then the quantity, right?
00:05:10.554 --> 00:05:14.440
1 over root n is
the right normalization.
00:05:14.441 --> 00:05:19.268
That is because if I sum these
n things, variances sums up.
00:05:19.269 --> 00:05:22.211
So the variance of this
is n times sigma squared.
00:05:22.212 --> 00:05:23.620
If we divide it by root n,
00:05:23.621 --> 00:05:26.550
the variance becomes
sigma squared, right?
00:05:26.551 --> 00:05:28.909
So it tends to in distribution.
00:05:28.910 --> 00:05:36.039
It converges to the normal with
mean 0 sigma squared, okay?
00:05:36.040 --> 00:05:37.372
This is only a limit statement.
00:05:37.373 --> 00:05:42.180
What we're going to try now is
to get statements that hold for
00:05:42.181 --> 00:05:47.320
every n, because for
us to use, we need that.
00:05:47.321 --> 00:05:49.699
Further, this limit
statement has some problems,
00:05:49.700 --> 00:05:50.951
it's not a uniform limit.
00:05:50.952 --> 00:05:55.650
So it's generally not
immediately useful.
00:05:55.651 --> 00:05:57.970
But that's going to be the guide
of what we want to get, right?
00:05:57.971 --> 00:06:01.382
Central limit theorem.
00:06:01.383 --> 00:06:05.905
So one remark, you cannot get
this if only the second moment
00:06:05.906 --> 00:06:07.814
is assumed to be finite.
00:06:07.815 --> 00:06:11.842
Central limit theorem needs only
to assume the second moment is
00:06:11.843 --> 00:06:12.990
finite, right?
00:06:12.991 --> 00:06:16.198
So the third moment
need not be finite.
00:06:16.199 --> 00:06:21.334
So for instance,
if you have the density,
00:06:21.335 --> 00:06:28.191
probability (random variable
x = x) = c over x cubed.
00:06:28.192 --> 00:06:31.328
Sorry, plus a little bit.
00:06:31.329 --> 00:06:32.845
Okay, that has a finite.
00:06:32.846 --> 00:06:34.523
Second moment, that's fine.
00:06:34.524 --> 00:06:38.624
Doesn't have a finite
third moment, but
00:06:38.625 --> 00:06:43.555
that's true for
x squared to the 1, let's say.
00:06:47.448 --> 00:06:49.380
Okay, that's good enough for
the central limit theorem.
00:06:49.381 --> 00:06:50.422
It's not good enough for us.
00:06:50.423 --> 00:06:52.261
So to get this going,
00:06:52.262 --> 00:06:56.735
it is not sufficient to
assume the second moment.
00:06:56.736 --> 00:06:59.997
For that particular density,
it's a power law.
00:06:59.998 --> 00:07:02.355
You can see what happens.
00:07:02.356 --> 00:07:06.051
I won't do it here, but you can
do a calculation to see that.
00:07:06.052 --> 00:07:09.695
In fact, you don't get
very nice tail bounds.
00:07:09.696 --> 00:07:12.965
You wouldn't get,
that's the problem.
00:07:12.966 --> 00:07:13.480
I mean, the tail is
there in fact, right?
00:07:13.481 --> 00:07:16.793
So you wouldn't get very
good tail bounds so.
00:07:16.794 --> 00:07:21.553
Okay, so we're going to look at,
so most of the lecture today,
00:07:21.554 --> 00:07:24.310
I'll spend in tail bounds.
00:07:24.311 --> 00:07:27.674
And we'll come back to geometry
either at the end of today or
00:07:27.675 --> 00:07:30.919
the next lecture and start
looking at Gaussian annulus.
00:07:30.920 --> 00:07:33.288
But this tail bound is going
to be quiet useful, right?
00:07:33.289 --> 00:07:34.780
First, Markov.
00:07:34.781 --> 00:07:35.416
Just a recap,
00:07:35.417 --> 00:07:37.960
you have a non-negative real
value of random variable.
00:07:37.961 --> 00:07:43.076
Then the probability that is
greater than t is bounded
00:07:43.077 --> 00:07:49.224
by something that goes down to 0
as 1 over t, as t gets farther.
00:07:49.225 --> 00:07:51.632
And a proof is just that
there are no cancellations.
00:07:51.633 --> 00:07:55.004
So the expected value of X is at
least what you get from things
00:07:55.005 --> 00:07:56.831
that are greater than t, okay?
00:07:56.832 --> 00:08:01.003
It would be true if the random
variable can take negative
00:08:01.004 --> 00:08:05.730
values that could cancel it out,
and this is not true, right?
00:08:05.731 --> 00:08:07.156
So only if were a non-negative.
00:08:07.157 --> 00:08:10.208
And then Chebychev now
is just saying that,
00:08:10.209 --> 00:08:13.515
that the deviation of
a mean is bounded by that.
00:08:13.516 --> 00:08:15.380
And the proof is just applying.
00:08:15.381 --> 00:08:18.902
Take the square of this and
apply, which is non-negative,
00:08:18.903 --> 00:08:19.808
apply Markov.
00:08:19.809 --> 00:08:24.423
And here, the tail bound
goes to 0 as 1 over t
00:08:24.424 --> 00:08:27.780
squared as t goes to infinity. Okay?
00:08:27.781 --> 00:08:29.870
T is just 1 over t squared.
00:08:29.871 --> 00:08:32.140
Now, can you do better?
00:08:32.141 --> 00:08:34.170
Can we get a tail
probability that falls off
00:08:34.171 --> 00:08:36.750
more than quadratically in t.
00:08:36.751 --> 00:08:38.739
Falling off
00:08:39.890 --> 00:08:42.530
I should've said here more
than a quadratic, right?
00:08:42.531 --> 00:08:43.580
Faster than quadratic.
00:08:45.050 --> 00:08:45.960
And in fact,
00:08:45.961 --> 00:08:48.740
it's simple [INAUDIBLE]
higher moments would help.
00:08:48.741 --> 00:08:51.800
So moments are going to
play a key role in this.
00:08:51.801 --> 00:08:54.989
It's often these Tail
bounds are not always,
00:08:54.990 --> 00:08:57.530
of course traditionally
they were.
00:08:57.531 --> 00:09:00.180
But not always proved
by taking moments, but
00:09:00.181 --> 00:09:01.100
I'm going to do that here.
00:09:01.101 --> 00:09:04.730
So higher sental moments.
00:09:04.731 --> 00:09:08.210
So if you take the oct moment.
00:09:08.211 --> 00:09:10.470
Think of oct as higher than two,
right?
00:09:10.471 --> 00:09:11.380
Take the oct moment.
00:09:11.381 --> 00:09:14.020
So you get 1/t to be oct.
00:09:15.430 --> 00:09:18.950
That is just step one
in Malikov inequality.
00:09:18.951 --> 00:09:22.130
To the random variable X- E(X),
absolute value
00:09:22.131 --> 00:09:26.630
to the r [INAUDIBLE] Usually we
want to apply it but r even so
00:09:26.631 --> 00:09:29.100
we don't have to worry
about the absolute values.
00:09:29.101 --> 00:09:30.670
I'm sorry,
we don't have to worry
00:09:32.170 --> 00:09:35.559
about the absolute value
signs there, right.
00:09:35.560 --> 00:09:36.581
If r is all that you
need to put there.
00:09:36.582 --> 00:09:40.070
But think of it being applied
when r is an even integer.
00:09:40.071 --> 00:09:42.270
Of course,
r doesn't have to be an integer.
00:09:42.271 --> 00:09:45.292
It's true for
any of real-valued r.
00:09:45.293 --> 00:09:50.150
So the higher the r is the
greater the fall, rate of fall.
00:09:50.151 --> 00:09:51.100
You get 1 over t to the r.
00:09:52.620 --> 00:09:55.320
So, that quantity
doesn't depend on t.
00:09:55.321 --> 00:09:56.540
1 over t to the r does.
00:09:56.541 --> 00:10:00.390
And if r is very high then
you get a very steep fall.
00:10:00.391 --> 00:10:04.230
But for that you need to
bound higher moments, okay.
00:10:04.231 --> 00:10:05.854
You need no bound higher amount.
00:10:05.855 --> 00:10:08.676
So we'll see how to do that.
00:10:08.677 --> 00:10:12.256
So I'm going to state now
what we call in the book,
00:10:12.257 --> 00:10:15.290
the Master Tail Bounds Theorem.
00:10:15.291 --> 00:10:19.730
Only because it's going to
imply tail bounds by channel
00:10:19.731 --> 00:10:23.480
of [INAUDIBLE] exponential and
all other things.
00:10:23.481 --> 00:10:26.902
Traditionally, there is
a particular proof of channel
00:10:26.903 --> 00:10:27.494
bounds.
00:10:27.495 --> 00:10:29.300
I'll go over that in a minute.
00:10:29.301 --> 00:10:31.970
Some of you are familiar.
00:10:31.971 --> 00:10:34.700
But then that doesn't work for
Gaussians.
00:10:34.701 --> 00:10:36.830
It only works for
boundary random variables.
00:10:36.831 --> 00:10:40.870
So traditionally, every one of
these want to separately but
00:10:40.871 --> 00:10:42.650
this is going to
give us all of them.
00:10:42.651 --> 00:10:44.478
It's the reason for
calling it Master.
00:10:44.479 --> 00:10:48.407
So, you have independent
random variables means
00:10:48.408 --> 00:10:51.223
0 variance at most
sigma squared.
00:10:51.224 --> 00:10:52.927
Not identical, right?
00:10:56.997 --> 00:10:59.063
For those that are interested,
00:10:59.064 --> 00:11:02.892
this actually holds even
without assuming independence,
00:11:02.893 --> 00:11:06.680
it holds in a more general
context, it holds for.
00:11:06.681 --> 00:11:08.220
But right now,
we'll only look at,
00:11:08.221 --> 00:11:10.167
here we'll look only
at independent things.
00:11:10.168 --> 00:11:13.904
So I'm going to take some
integer s, which is less than n
00:11:13.905 --> 00:11:17.690
sigma squared over 2,
for whatever reason.
00:11:17.691 --> 00:11:19.280
And here,
the crucial assumption.
00:11:19.281 --> 00:11:23.310
I'm going to assume the r
moment, if r is odd
00:11:23.311 --> 00:11:25.340
this could be negative so
I put absolute value sign.
00:11:26.700 --> 00:11:29.886
Does not grow faster
than r factorial and
00:11:29.887 --> 00:11:33.460
this factorial is crucial and
I'll tell you in a minute why.
00:11:33.461 --> 00:11:38.069
So this part of the is going to
be fairly technical because I'm
00:11:38.070 --> 00:11:40.730
gonna have to prove this, right?
00:11:40.731 --> 00:11:44.942
But the point is moments only
grow as of most factorial
00:11:44.943 --> 00:11:47.760
times sigma squared, number one.
00:11:47.761 --> 00:11:52.500
Number two, I don't assume all
moments exists only certain up
00:11:52.501 --> 00:11:53.170
to a limit.
00:11:54.410 --> 00:11:58.471
So if possible that beyond that,
the moment don't exist
00:11:58.472 --> 00:12:01.948
either and certainly not
necessarily bounded.
00:12:01.949 --> 00:12:06.388
So just so this r moment grows
as at most r factorial, and
00:12:06.389 --> 00:12:09.230
the theorem,
this is the theorem.
00:12:09.231 --> 00:12:12.352
And I will prove this which will
take us a little bit of time
00:12:12.353 --> 00:12:13.770
will prove this.
00:12:13.771 --> 00:12:17.320
The tail bound the probability
that x greater than any number
00:12:17.321 --> 00:12:22.130
a, provided a is not very big,
very large deviations,
00:12:22.131 --> 00:12:23.620
this is not valid.
00:12:23.621 --> 00:12:27.701
But as long a is at
most [INAUDIBLE] either
00:12:27.702 --> 00:12:32.410
the minus a squared
times something, okay.
00:12:32.411 --> 00:12:35.170
Let's at that something
before we prove this.
00:12:35.171 --> 00:12:38.651
So each of these random
variables have variants of most
00:12:38.652 --> 00:12:39.715
sigma squared.
00:12:39.716 --> 00:12:43.120
If I add up N of them,
the variance is N sigma squared,
00:12:43.121 --> 00:12:43.740
at most.
00:12:43.741 --> 00:12:47.130
So an upper bound on the
variance or that should be X1.
00:12:49.400 --> 00:12:51.150
Upper bound on
the variance of X1 + XM,
00:12:51.151 --> 00:12:53.400
because they're independent.
00:12:53.401 --> 00:12:58.440
So if I did have
the sentlimiterum type behavior,
00:12:58.441 --> 00:12:59.660
then this is correct.
00:12:59.661 --> 00:13:02.770
I would have a squared
divided by the variance.
00:13:04.430 --> 00:13:05.830
I'm not worried about constants.
00:13:05.831 --> 00:13:06.660
The 12 is higher
00:13:06.661 --> 00:13:08.930
than what central limits
that I will give you.
00:13:08.931 --> 00:13:13.510
We do two, but we're getting it
for all n, so within constants,
00:13:13.511 --> 00:13:17.195
it is like the behavior of
the sample limit theorem.
00:13:17.196 --> 00:13:21.940
Okay, so in some sense,
the tails behave like this.
00:13:23.610 --> 00:13:25.210
Now, they don't
behave like this for
00:13:25.211 --> 00:13:29.150
very large a, but
that's always going to be true.
00:13:29.151 --> 00:13:33.330
So you should always
revisit this little thing.
00:13:33.331 --> 00:13:38.681
That if I have coin tosses,
if I toss a coin,
00:13:38.682 --> 00:13:44.613
I toss n coins,
the probability of heads being,
00:13:44.614 --> 00:13:49.242
p so this the [INAUDIBLE]
p each of them,
00:13:49.243 --> 00:13:53.580
then the tails of sum of xi,
1 to n,
00:13:53.581 --> 00:13:58.382
are not Gaussian
beyond a certain point
00:14:04.069 --> 00:14:06.375
So c x squared beyond sem x.
00:14:09.506 --> 00:14:12.950
So very large deviations you get
only simple exponential bounds.
00:14:12.951 --> 00:14:14.840
That's always going to
be a problem, right?
00:14:14.841 --> 00:14:16.160
Again that's required for
calculation.
00:14:16.161 --> 00:14:18.650
Actually doesn't require
too much of a calculation.
00:14:18.651 --> 00:14:22.340
The probability since,
I can write this down,
00:14:22.341 --> 00:14:24.660
the probability that
X Y is greater than or
00:14:24.661 --> 00:14:28.250
equal to N means all the lines
have to come out to one, right?
00:14:28.251 --> 00:14:29.760
Is, is p to the n.
00:14:31.520 --> 00:14:34.430
Whereas, had it been sub
Gaussian, you should be getting
00:14:34.431 --> 00:14:36.920
an N squared value,
which you don't get, right?
00:14:36.921 --> 00:14:37.754
You only get an N.
00:14:43.473 --> 00:14:45.320
Let's see.
00:14:45.321 --> 00:14:48.680
So, this is a Gaussian bound
with a correct variance,
00:14:48.681 --> 00:14:50.210
except for constants.
00:14:50.211 --> 00:14:51.568
You get that.
00:14:51.569 --> 00:14:55.986
Now, why [INAUDIBLE] factorial,
okay?
00:14:55.987 --> 00:14:59.575
I have to write on the board
because I didn't write that.
00:14:59.576 --> 00:15:06.544
So, in fusion of our factorial.
00:15:06.545 --> 00:15:13.148
So many proofs of tail down
start by saying the probability
00:15:13.149 --> 00:15:20.041
that the random variable X is
greater than or equal to T...
00:15:20.042 --> 00:15:24.812
Right, equals the probability
that e to the lambda x,
00:15:26.868 --> 00:15:30.078
is greater than or
equal to e to the lambda t for
00:15:30.079 --> 00:15:34.230
positive lambda are all
positive lambda, okay.
00:15:34.231 --> 00:15:34.920
This is so
00:15:34.921 --> 00:15:37.120
just because the exponential
function is monotone,
00:15:37.121 --> 00:15:40.710
right, which is less than or
equal to the expected value
00:15:40.711 --> 00:15:45.540
of e to the And
then the x- lambda t.
00:15:47.050 --> 00:15:49.919
This is Markov inequality,
right?
00:15:53.036 --> 00:15:54.900
Who has seen a proof
that begins like this?
00:15:57.620 --> 00:16:00.260
Yeah, this is called
the Bernstein method [INAUDIBLE]
00:16:00.261 --> 00:16:01.510
function method.
00:16:01.511 --> 00:16:03.940
It's used for
[INAUDIBLE], right?
00:16:03.941 --> 00:16:07.540
So many proofs start like this.
00:16:07.541 --> 00:16:14.740
Now, with x,
it's sum of x i, 1 to n.
00:16:14.741 --> 00:16:16.120
So that's x.
00:16:17.800 --> 00:16:21.940
And then, this factors.
00:16:21.941 --> 00:16:24.390
That's either the lambda x,
x is the sum of x, y,
00:16:24.391 --> 00:16:25.560
it factors, right?
00:16:25.561 --> 00:16:27.544
It's just a product.
00:16:27.545 --> 00:16:32.798
I won't do that proof but
that can only work if
00:16:32.799 --> 00:16:39.870
the exponential numbers exist,
okay, this is valid only if
00:16:42.223 --> 00:16:47.190
Expected value of e
to the lambda x by xi
00:16:47.191 --> 00:16:52.167
exists, otherwise
it's not valid.
00:16:52.168 --> 00:16:55.412
And now if I expand
e to the lambda,
00:17:02.962 --> 00:17:05.652
I'm sorry, I should have put
that in the slides but anyway,
00:17:05.653 --> 00:17:07.710
so I do a Taylor series of this,
right?
00:17:07.711 --> 00:17:12.443
So then I will get,
I used t already, so
00:17:12.444 --> 00:17:17.750
I get i = 0 to infinity,
lambda to the i over
00:17:17.751 --> 00:17:23.069
i factorial,
expected value of x squared.
00:17:23.070 --> 00:17:26.937
I just want to explain
the exponential Taylor series.
00:17:26.938 --> 00:17:31.101
So if expected value of x to the
i grew faster than i factorial,
00:17:31.102 --> 00:17:32.392
there's no hope.
00:17:32.393 --> 00:17:34.624
The series does not converge,
right?
00:17:34.625 --> 00:17:37.653
So this method is, I mean, it
may work for other reasons, but
00:17:37.654 --> 00:17:39.766
this method is not
likely to work, right?
00:17:39.767 --> 00:17:42.501
There may be cancellation, but
it's not absolutely convergent.
00:17:42.502 --> 00:17:47.046
So if this grows faster than,
so if,
00:17:48.967 --> 00:17:55.588
Grows faster than x to
the i than i factorial,
00:17:55.589 --> 00:18:01.871
the series does not
absolutely converge.
00:18:01.872 --> 00:18:04.202
No absolute convergence.
00:18:05.920 --> 00:18:07.402
So then the Bernstein method,
00:18:07.403 --> 00:18:09.240
this is called
the Bernstein method.
00:18:09.241 --> 00:18:11.531
Again, I won't
follow this method.
00:18:11.532 --> 00:18:15.359
I'm gonna tell you that it's
better to abandon that than go
00:18:15.360 --> 00:18:19.490
with moments, that's
the spirit of this theorem.
00:18:19.491 --> 00:18:21.560
But if you want to
use Bernstein method,
00:18:21.561 --> 00:18:25.930
it better be true that it's
bounded by i factorial.
00:18:25.931 --> 00:18:30.597
Actually, it should be bounded
by something slightly less i
00:18:30.598 --> 00:18:31.635
factorial.
00:18:31.636 --> 00:18:34.410
>> [INAUDIBLE]
>> Yeah, so yes,
00:18:34.411 --> 00:18:35.860
for lambda less than one.
00:18:35.861 --> 00:18:38.176
So there is some play
with the lambda, yes.
00:18:41.069 --> 00:18:44.824
But, okay, so there is
some play with the lambda.
00:18:44.825 --> 00:18:47.632
Provided, I mean,
if it does actually strictly
00:18:47.633 --> 00:18:49.320
factorial that goes-
>> [INAUDIBLE]
00:18:49.321 --> 00:18:53.732
>> So if it is i to something
00:18:53.733 --> 00:18:57.269
greater than one.
00:19:00.580 --> 00:19:02.640
Then you will be
in some trouble.
00:19:02.641 --> 00:19:07.295
So, yeah, if it is potentially
more than i factorial then
00:19:07.296 --> 00:19:09.200
you will be in trouble.
00:19:09.201 --> 00:19:13.926
Okay, so the factorial is,
agrees with that,
00:19:13.927 --> 00:19:18.063
but most importantly,
that statement,
00:19:18.064 --> 00:19:23.262
this method only works if for
every i however large,
00:19:23.263 --> 00:19:28.240
it bounded with something
like i factorial.
00:19:28.241 --> 00:19:30.990
Whereas here, we assume
only finite moments, right?
00:19:30.991 --> 00:19:32.130
So this is very crucial.
00:19:32.131 --> 00:19:33.480
So typically,
00:19:33.481 --> 00:19:38.480
in this we'll encounter what
it called long tail density.
00:19:38.481 --> 00:19:42.160
Long tailed just means the tail
is not exponentially going off.
00:19:42.161 --> 00:19:44.210
So for
instance if you had something,
00:19:44.211 --> 00:19:47.320
everybody knows what a power law
is, power law is just how long
00:19:47.321 --> 00:19:49.490
they'll take care
of the density.
00:19:49.491 --> 00:19:51.970
If you had a power law,
this is still valid.
00:19:51.971 --> 00:19:53.560
You have finite moments, but
00:19:53.561 --> 00:19:55.270
you will not have
very large moments.
00:19:55.271 --> 00:19:57.410
So this is useful even when
there are finite moments.
00:19:58.640 --> 00:20:01.390
So, I'm gonna take, it's gonna
take some time to prove this,
00:20:01.391 --> 00:20:02.670
four or five slides.
00:20:02.671 --> 00:20:04.180
We're gonna go with the proof.
00:20:04.181 --> 00:20:10.179
It's elementary,
there's nothing phenomenally
00:20:10.180 --> 00:20:16.041
sophisticated going on,
there's common threads
00:20:22.992 --> 00:20:27.677
So the idea is to prove an upper
bound on the rth moment and
00:20:27.678 --> 00:20:29.880
then use that for the sum.
00:20:31.310 --> 00:20:32.640
The use Markov's inequality.
00:20:33.860 --> 00:20:35.250
So the sum is this,
00:20:35.251 --> 00:20:38.720
is to expand this, you do
the multinomial expansion.
00:20:38.721 --> 00:20:41.100
We start with a multinomial
expansion, so we get that.
00:20:41.101 --> 00:20:46.834
So this is just a sum over all
partitions of r into r1 to rn,
00:20:46.835 --> 00:20:48.320
right?
00:20:48.321 --> 00:20:52.530
The exponents of x, I mean it's
how many times, when you expand
00:20:52.531 --> 00:20:58.380
this, xi occurs, just in the
multinomial expansion, right?
00:20:58.381 --> 00:21:01.520
And that is that, and
00:21:01.521 --> 00:21:04.740
I now take the expected
value of both sides.
00:21:04.741 --> 00:21:08.640
I want the expected value
of this whole thing.
00:21:08.641 --> 00:21:13.120
And I wanna take the expected
value of this, okay?
00:21:13.121 --> 00:21:15.050
And something nice happens.
00:21:15.051 --> 00:21:17.700
So let's try to guess what
nice thing happens and
00:21:17.701 --> 00:21:18.980
then we'll see it.
00:21:20.020 --> 00:21:22.910
Lot of these terms
have zero expectation.
00:21:22.911 --> 00:21:24.775
So which terms have
zero expectation?
00:21:24.776 --> 00:21:30.528
>> [INAUDIBLE]
>> I'm sorry.
00:21:30.529 --> 00:21:35.322
Yeah, whenever r i is one,
right?
00:21:35.323 --> 00:21:38.348
So anytime an r i is one,
independent says that
00:21:38.349 --> 00:21:41.219
the expected value of
each thing splits up,
00:21:41.220 --> 00:21:44.530
I'm writing all of
this down in a minute.
00:21:44.531 --> 00:21:48.230
But if r i is one the expected
value of x i is zero,
00:21:48.231 --> 00:21:50.430
because we started that, right?
00:21:50.431 --> 00:21:54.804
So where r i range over
all nonnegative integers
00:21:54.805 --> 00:21:56.122
summing to r.
00:21:56.123 --> 00:22:00.637
By independence, so first we
have to peel off the terms
00:22:00.638 --> 00:22:04.870
independence factor
of like that, okay?
00:22:04.871 --> 00:22:07.791
And then we get,
if in any term, ri = 1,
00:22:07.792 --> 00:22:10.390
the term is zero
since that's zero.
00:22:10.391 --> 00:22:17.338
So r i is either zero or
at least two, okay?
00:22:17.339 --> 00:22:19.688
By the way, in all these proofs,
that's crucial.
00:22:19.689 --> 00:22:22.130
If r i equals one
terms were there,
00:22:22.131 --> 00:22:25.430
then you would not get
this kind of result.
00:22:25.431 --> 00:22:26.770
You would not get
Gaussian behavior.
00:22:29.190 --> 00:22:32.488
Okay, so, now we have to,
we are going to do some common
00:22:32.489 --> 00:22:35.592
[INAUDIBLE] to figure out
how many terms are aligned.
00:22:35.593 --> 00:22:38.341
So let's consider all of
the terms where each r i
00:22:38.342 --> 00:22:40.086
is either zero or two or more.
00:22:40.087 --> 00:22:42.881
Cuz we are only
interested in those,
00:22:42.882 --> 00:22:46.680
we're not interested in
the terms that r i is one.
00:22:46.681 --> 00:22:51.561
Every r i has to be two or
more if it's nonzero.
00:22:51.562 --> 00:22:54.768
So I assumed this r factorial.
00:22:54.769 --> 00:22:59.724
So the expected value of r,
this is from
00:22:59.725 --> 00:23:04.390
the last slide,
there was, sorry.
00:23:04.391 --> 00:23:08.200
So, if I raise to the r i power,
I get only r i factor, this was
00:23:08.201 --> 00:23:13.210
the hypothesis, therefore, that
cancels out and you get this.
00:23:13.211 --> 00:23:15.798
So, this actually j
is in the subscript.
00:23:15.799 --> 00:23:21.209
And you get sigma to the So,
00:23:21.210 --> 00:23:24.898
this is not a minus sign,
00:23:24.899 --> 00:23:29.655
okay, so maybe I should write.
00:23:38.101 --> 00:23:41.660
It's sum over something,
00:23:41.661 --> 00:23:48.167
sigma to the twice number
of nonzero, all right.
00:23:50.616 --> 00:23:53.043
So that's because everytime
r i was nonzero we assumed
00:23:53.044 --> 00:23:56.530
the boundaries that most said,
see, this is what I'm applying.
00:23:56.531 --> 00:23:59.570
You get only a sigma squared
every time r i is nonzero.
00:23:59.571 --> 00:24:03.340
It's important,
by the way that sigma squared
00:24:03.341 --> 00:24:05.270
doesn't go up as
sigma into the r i.
00:24:05.271 --> 00:24:10.720
Some thought is required
to make sure that all
00:24:10.721 --> 00:24:14.870
of those important, but later
you can think about that, right?
00:24:14.871 --> 00:24:18.670
So, collect terms which
have t non-zeroes
00:24:18.671 --> 00:24:21.190
because they have
the same power of sigma.
00:24:21.191 --> 00:24:26.112
And call them this, this is the
set of our n tuples that have
00:24:26.113 --> 00:24:29.846
t non-zeroes,
this is in our commentary.
00:24:32.093 --> 00:24:35.460
And then you'll get x factor
value of x to the r is this.
00:24:35.461 --> 00:24:36.940
This is what I'm trying to bond.
00:24:36.941 --> 00:24:41.878
So everything that is seen
non-zeroes contributes sigma
00:24:41.879 --> 00:24:43.695
to the two t, right?
00:24:43.696 --> 00:24:46.738
So, and there's an r
factor to the outside, so
00:24:46.739 --> 00:24:48.189
we have to do all that.
00:24:51.231 --> 00:24:54.828
Again, as I said,
this lecture will be technical.
00:24:54.829 --> 00:24:58.000
I have to,
there are maybe two more slides.
00:24:58.001 --> 00:25:02.398
So we crucially have to bound.
00:25:02.399 --> 00:25:07.104
The number of R1 through R
Interpol's adding up to R there
00:25:07.105 --> 00:25:10.020
it's like most t of
the [INAUDIBLE].
00:25:10.021 --> 00:25:15.150
First of all rich t
you choose first,
00:25:15.151 --> 00:25:19.580
enter the t subset, one of them
is the t that you've chosen to
00:25:19.581 --> 00:25:24.010
put nonzero r on once you
fix a subset as a set of t
00:25:24.011 --> 00:25:29.010
values which are nonzero,
each of those values nonzero.
00:25:29.011 --> 00:25:30.580
So it must be at least 2.
00:25:30.581 --> 00:25:38.776
So we allocate 2 from our budget
of r to each of these ris.
00:25:38.777 --> 00:25:41.590
And then the allocated
arbitrary.
00:25:41.591 --> 00:25:43.946
Again, this is right?
00:25:43.947 --> 00:25:44.984
So what is this?
00:25:44.985 --> 00:25:46.450
So let's go with this.
00:25:46.451 --> 00:25:49.970
The already have to give
each one for 2 at least.
00:25:49.971 --> 00:25:51.380
So I give 2.
00:25:51.381 --> 00:25:53.500
So that's r- 2t and
00:25:53.501 --> 00:25:58.220
I must partition down- 2t
to t- 1 I can arbitrarily
00:25:58.221 --> 00:26:02.118
allot any number to each of the
parts, and that's what I get.
00:26:02.119 --> 00:26:06.960
Okay, there's a number
of three doubles.
00:26:06.961 --> 00:26:09.560
There's not permutations here
because these are not labels.
00:26:09.561 --> 00:26:12.710
You have to think at home,
make sure the proof is correct
00:26:12.711 --> 00:26:14.840
comment, already,
but I'll be fair.
00:26:14.841 --> 00:26:15.910
Check that.
So
00:26:15.911 --> 00:26:18.560
you get that J of t
is at most this much.
00:26:18.561 --> 00:26:20.870
Is not a precious estimate but
it's.
00:26:20.871 --> 00:26:28.280
Putting this together you will
get that, you get this for that.
00:26:28.281 --> 00:26:33.360
Now, we prove, so
t only goes up to r over 2,
00:26:33.361 --> 00:26:35.460
important know it
doesn't go up to r.
00:26:35.461 --> 00:26:37.780
It goes only up to
r over 2 because.
00:26:37.781 --> 00:26:40.950
Everybody which is present is
presented at least as two.
00:26:40.951 --> 00:26:41.820
The level two right.
00:26:41.821 --> 00:26:44.300
So they're all level two things
00:26:44.301 --> 00:26:47.690
will prove that r over
two is the max of this.
00:26:47.691 --> 00:26:51.740
So this more or
less is going to look like all
00:26:52.750 --> 00:26:56.560
times if I put T equals
R over two there.
00:26:56.561 --> 00:27:01.101
So we get n sigma
squared R over two.
00:27:02.580 --> 00:27:09.470
Divided by r over 2 factorial,
2 to the r over 2.
00:27:09.471 --> 00:27:11.380
Okay, all I'm doing is
putting t equals r over 2.
00:27:11.381 --> 00:27:16.040
Now, if you use sterling and
other things, this is
00:27:16.041 --> 00:27:21.028
r to the r over 2, so we get
r n sigma to the r over 2.
00:27:21.029 --> 00:27:22.172
Another 2.
00:27:24.210 --> 00:27:29.130
We'll do this manipulation
the next slide.
00:27:29.131 --> 00:27:31.610
But the moral of the story
is very important.
00:27:31.611 --> 00:27:36.860
The moral is, the moment of
00:27:36.861 --> 00:27:42.860
the sum grows only
as R over 2 power.
00:27:50.211 --> 00:27:55.040
Okay, now it requires some
thought and I'll do it in
00:27:55.041 --> 00:27:58.520
the next slide, I'll do
the manipulation the next slide.
00:27:58.521 --> 00:28:01.880
But that is essential for
sub Gaussian behavior, so
00:28:01.881 --> 00:28:05.220
this is essential For
e to the minus x squared.
00:28:06.470 --> 00:28:09.920
That square there comes
from the r over 2.
00:28:09.921 --> 00:28:13.550
If I didn't have r over 2,
if I just had r,
00:28:13.551 --> 00:28:15.010
which is what I
would get trivially.
00:28:15.011 --> 00:28:16.070
So remember,
00:28:16.071 --> 00:28:20.140
I'm bounding expected
value of X i to r power.
00:28:21.210 --> 00:28:24.360
So there are n to the r terms.
00:28:24.361 --> 00:28:29.740
N to all terms So,
00:28:29.741 --> 00:28:32.780
I might be expecting an R there,
if I did,
00:28:32.781 --> 00:28:37.830
I would not get Gaussian okay,
again, it requires a little
00:28:37.831 --> 00:28:41.090
bit of manipulation at home to
make sure that you see that.
00:28:41.091 --> 00:28:42.970
But, I won't do
the manipulation, but
00:28:42.971 --> 00:28:45.100
I tell you that That
r over 2 is crucial.
00:28:47.200 --> 00:28:51.050
So now I want to prove that
00:28:51.051 --> 00:28:56.050
t = r over 2 is the max, okay?
00:28:56.051 --> 00:28:57.350
So that's the [INAUDIBLE].
00:28:57.351 --> 00:29:04.150
For t less than r over 2,
it's easy to see.
00:29:05.160 --> 00:29:08.530
That h t is
an increasing function.
00:29:08.531 --> 00:29:09.990
Is increasing well enough,
00:29:09.991 --> 00:29:13.410
fast enough, by a factor of
at least two each time and
00:29:13.411 --> 00:29:17.800
therefore the sum is what
happens with r over two.
00:29:17.801 --> 00:29:21.120
Then as you go down from r over
two, it's falling drastically,
00:29:21.121 --> 00:29:23.630
so that's only a constant.
00:29:23.631 --> 00:29:25.640
At the end of the you get this.
00:29:25.641 --> 00:29:26.920
Which is what I
wrote down there.
00:29:31.770 --> 00:29:34.320
So what have I got up to this?
00:29:34.321 --> 00:29:38.730
I've got the most crucial
thing I want from all of this
00:29:38.731 --> 00:29:42.110
calculation is that
the exponent is all over 2.
00:29:42.111 --> 00:29:42.770
Not r, okay?
00:29:44.120 --> 00:29:46.508
And that's what is going to
give us [INAUDIBLE] of Gaussian
00:29:46.509 --> 00:29:54.438
division Then you just apply
Markov inequality to get this.
00:29:54.439 --> 00:29:58.488
Again, you have an r
over two at the top.
00:29:58.489 --> 00:30:02.943
Now I want each of
the minus a squared but
00:30:02.944 --> 00:30:06.320
I've got the optimum And
00:30:06.321 --> 00:30:08.310
it turns out that's another
standard calculation.
00:30:10.330 --> 00:30:13.620
This moment bound applies
00:30:13.621 --> 00:30:15.930
to every moment r less than or
equal to s.
00:30:15.931 --> 00:30:20.410
If you remember the proof
of channel, then those
00:30:20.411 --> 00:30:24.210
bounds apply for every finite
r and you choose the best one.
00:30:24.211 --> 00:30:25.890
Okay, that's all we
are going to do.
00:30:25.891 --> 00:30:29.360
Except we have
a restriction very high.
00:30:29.361 --> 00:30:36.660
You can not get arbitrary Okay,
and then we do some calculus and
00:30:38.107 --> 00:30:43.200
to be the largest So
this Function
00:30:43.201 --> 00:30:47.330
is minimized at a certain r,
which you can find by calculus.
00:30:47.331 --> 00:30:49.350
You take the log of that and
minimize that.
00:30:50.580 --> 00:30:51.530
You find the calculus.
00:30:51.531 --> 00:30:53.110
On the next slide
we'll see that.
00:30:53.111 --> 00:30:55.540
And then r is basically
taken to be that value.
00:30:56.590 --> 00:30:59.210
And that value turns out to
be less than x, so it works.
00:31:01.630 --> 00:31:06.090
So this is a statement
that says all of that, so
00:31:06.091 --> 00:31:10.730
we are a function that looks
like x to the x over 2 and
00:31:10.731 --> 00:31:13.660
you take log and minimize
00:31:15.420 --> 00:31:19.902
then you see that minimize
that point just differentiate
00:31:19.903 --> 00:31:25.020
[INAUDIBLE] they are minimize
to that point so [INAUDIBLE] And
00:31:28.070 --> 00:31:31.950
then at that point you plug
the minimum value in and
00:31:31.951 --> 00:31:36.490
you get this, which is at most a
squared over 12 n sigma squared,
00:31:36.491 --> 00:31:37.360
that proves the theorem.
00:31:39.360 --> 00:31:41.440
Yeah, do you have a question,
no.
00:31:43.260 --> 00:31:44.780
So you do have to
check all the details,
00:31:44.781 --> 00:31:47.720
I mean there are several
calculation details that you
00:31:47.721 --> 00:31:50.480
have to check, but at the end
of the day we end up with that.
00:31:54.060 --> 00:31:57.530
Okay, so that proves
the Master Conservation Theorem,
00:31:57.531 --> 00:32:01.350
I'm not going to give you
all the implications,
00:32:01.351 --> 00:32:03.900
I'll give you two implications
of this which we'll use.
00:32:03.901 --> 00:32:05.190
One is Chernoff.
00:32:05.191 --> 00:32:10.880
The other one, which comes next
is the Gaussian annulus theorem
00:32:10.881 --> 00:32:15.380
which we'll use for, I think
I'll be able to prove this and
00:32:15.381 --> 00:32:17.390
the Gaussian annulus
theorem today.
00:32:17.391 --> 00:32:19.720
We'll use the Gaussian
annulus theorem for
00:32:19.721 --> 00:32:21.340
the random projection theorem.
00:32:21.341 --> 00:32:23.070
Will see the next week.
00:32:23.071 --> 00:32:26.410
Which is So we will use that.
00:32:26.411 --> 00:32:27.650
Chernoff bounds.
00:32:27.651 --> 00:32:30.190
How many of you know
Chernoff bounds?
00:32:30.191 --> 00:32:31.870
Has seen Chernoff bounds?
00:32:31.871 --> 00:32:33.050
Many of you, okay.
00:32:33.051 --> 00:32:34.160
So, write that as a.
00:32:34.161 --> 00:32:38.150
So if you are independent and
know your random variables.
00:32:38.151 --> 00:32:41.740
Here there is one small
restriction for this, we applied
00:32:41.741 --> 00:32:48.128
directly, is P is not very close
to 1, the probability of heads.
00:32:48.129 --> 00:32:53.250
Then quality bound says
that the probability
00:32:53.251 --> 00:32:57.600
that it deviates by more than
n times p is this thing.
00:32:59.820 --> 00:33:03.864
Again, you cannot do very large
deviations only for c and
00:33:03.865 --> 00:33:06.317
most 1 gives this value, right?
00:33:06.318 --> 00:33:11.555
So the c behavior stops for
a certain cp so
00:33:11.556 --> 00:33:15.260
we got cs redundant, okay?
00:33:15.261 --> 00:33:18.815
And you get it simply by
applying the moments are good So
00:33:18.816 --> 00:33:21.753
simply by applying
the master K bound theorem
00:33:21.754 --> 00:33:23.080
you would get that.
00:33:23.081 --> 00:33:28.008
So you first center it so
that the expected value of 0,
00:33:28.009 --> 00:33:31.770
the variance we all
know is p times 1- p.
00:33:32.930 --> 00:33:36.010
But the moments are also bounded
by basically that the moments
00:33:36.011 --> 00:33:42.010
are also at most p times
1- p This actually,
00:33:42.011 --> 00:33:45.150
I wrote down a proof, it's
actually simple to see, right?
00:33:45.151 --> 00:33:49.793
Beyond the second moment
xy is at most 1 anyway so
00:33:49.794 --> 00:33:52.298
that's going to be true.
00:33:52.299 --> 00:33:54.770
You need bounds on moments for
our theorem, right.
00:33:54.771 --> 00:33:59.587
So this is all going to be true,
so apply this theorem
00:33:59.588 --> 00:34:03.865
with a C and T,
how large a deviation you have.
00:34:03.866 --> 00:34:07.840
Okay, I think that's all I've
said here, by the way, there
00:34:07.841 --> 00:34:12.060
are more details given in the
book, right, so, than the slide.
00:34:16.230 --> 00:34:19.468
So the usual proof of channel
sounds, again, is by going
00:34:19.469 --> 00:34:22.775
through the Bernstein method,
Which we erased, is going
00:34:22.776 --> 00:34:26.630
through the exponential moments
right, that's what we do.
00:34:26.631 --> 00:34:29.687
They do apply in this case,
exponential norms do exist
00:34:29.688 --> 00:34:33.480
because these are finite random
variables, they never exceed 1.
00:34:33.481 --> 00:34:35.450
So the exponential
moment certainly exists.
00:34:38.506 --> 00:34:40.898
Okay, now the next thing
I'm going to do is
00:34:40.899 --> 00:34:43.350
the Gaussian Annulus theorem.
00:34:43.351 --> 00:34:47.441
I'm going to state that first,
the exponential moments also
00:34:47.442 --> 00:34:51.543
exist here but I'm going to give
you a proof just based on the.
00:34:51.544 --> 00:34:54.157
But the random variables
are not bounded right,
00:34:54.158 --> 00:34:56.590
they're Gaussians they
can go to infinity.
00:35:00.485 --> 00:35:05.729
So some notation you have
a d-dimensional spherical
00:35:05.730 --> 00:35:10.856
Gaussian This is
the main is now both
00:35:10.857 --> 00:35:16.346
0 because of the center of
the origin and d dimensions.
00:35:16.347 --> 00:35:17.765
This is the identity matrix,
00:35:17.766 --> 00:35:20.980
it's the variance,
covariance matrix, right?
00:35:20.981 --> 00:35:24.739
So for
the Gaussian high dimensions,
00:35:24.740 --> 00:35:30.739
you describe it by mean
Variance-covariance matrix, so.
00:35:30.740 --> 00:35:36.696
So I is the variance-covariance
matrix.
00:35:39.143 --> 00:35:42.024
In statistics,
it's called sigma, right?
00:35:42.025 --> 00:35:44.807
Statisticians are notorious for
their notations, so
00:35:44.808 --> 00:35:45.930
they use sigma for.
00:35:45.931 --> 00:35:48.690
It's a terrible thing to do,
but they use sigma for
00:35:48.691 --> 00:35:50.190
variance covalent matrix.
00:35:50.191 --> 00:35:53.388
So it could be more generally,
and
00:35:53.389 --> 00:35:58.349
it may not be identity, but
in this case, identity,
00:35:58.350 --> 00:36:03.999
which means that level curves
of the density look like this.
00:36:04.000 --> 00:36:09.232
I mean, so I should say the
density corresponds to normal 0,
00:36:09.233 --> 00:36:11.656
I is e to the minus x squared.
00:36:11.657 --> 00:36:16.310
There's a normalizing constant,
2 pi to the d over 2.
00:36:19.866 --> 00:36:25.910
So the density is maximum
at 0 and falls off, okay?
00:36:25.911 --> 00:36:29.550
And the assertion is for any
data which is less than group B
00:36:29.551 --> 00:36:33.000
at most so
much of the probability mass.
00:36:33.001 --> 00:36:40.542
Again, not just the data squared
there lies between root B.
00:36:40.543 --> 00:36:43.506
Most of the mass lays
between B minus theta and
00:36:43.507 --> 00:36:45.330
root B plus theta.
00:36:45.331 --> 00:36:49.319
So maybe I should put up,
excuse me,
00:36:49.320 --> 00:36:55.253
I should tell you one more
line and then draw a picture.
00:36:55.254 --> 00:36:59.615
So intuition, so the expected
value of the length squared
00:36:59.616 --> 00:37:01.669
is just the sum of d things.
00:37:01.670 --> 00:37:06.200
Each has expectation 1,
normal 0.1, so that's v.
00:37:06.201 --> 00:37:11.451
So the mean squared distance is
v Radius, that's called radius,
00:37:11.452 --> 00:37:15.214
the square root remains
this is root d.
00:37:15.215 --> 00:37:19.782
And it says that which should
be concentrated about its
00:37:19.783 --> 00:37:24.160
expectation, that is roughly
what this is saying.
00:37:34.392 --> 00:37:35.840
Let's see what this is saying.
00:37:35.841 --> 00:37:40.169
So it's saying the high
probability is
00:37:40.170 --> 00:37:44.113
between route d + d and
route d = d.
00:37:44.114 --> 00:37:50.610
So x [INAUDIBLE] is between
d and Order root d.
00:37:54.351 --> 00:37:58.517
If I expand this, if I square
this up, the second term is more
00:37:58.518 --> 00:38:00.900
significant is
beta times root d.
00:38:00.901 --> 00:38:04.480
And the you can beta square it
which is a small order term.
00:38:04.481 --> 00:38:05.894
So this is more significant.
00:38:05.895 --> 00:38:08.340
So, if I draw a curve.
00:38:13.211 --> 00:38:20.126
So this is a distance and
the analyst is one.
00:38:20.127 --> 00:38:24.974
If I instead have normalized it
so that this is distance one for
00:38:24.975 --> 00:38:29.180
the Gaussian, the analyst
is distance one over root d
00:38:33.674 --> 00:38:38.510
So, this is different
from the sphere, right?
00:38:38.511 --> 00:38:43.191
The sphere was concentrated in
a smaller annulus, one over t.
00:38:43.192 --> 00:38:46.722
This is concentrated only in
an annulus of one over root t.
00:38:52.682 --> 00:38:57.023
Well of course it is true that
the sum of independent random
00:38:57.024 --> 00:39:01.454
variables, so you expect it
to be concentrated about it's
00:39:01.455 --> 00:39:06.160
expectation, and that is what
we will prove with our theorem.
00:39:11.251 --> 00:39:13.270
This all is going
faster than I expected.
00:39:13.271 --> 00:39:16.718
We are already at
the 12th side out of 13.
00:39:16.719 --> 00:39:17.656
So any questions?
00:39:17.657 --> 00:39:20.777
Or anybody want something?
00:39:20.778 --> 00:39:23.248
Let's take a break and
drink water right now.
00:39:27.014 --> 00:39:28.152
Everything okay so far?
00:39:31.300 --> 00:39:35.820
Okay, so I'll go on.
00:39:35.821 --> 00:39:36.409
So what are we doing?
00:39:36.410 --> 00:39:40.100
Also we are just doing they
ray r's is the radius.
00:39:40.101 --> 00:39:45.329
We want to make sure that 8
minus root b is at most beta,
00:39:45.330 --> 00:39:49.990
and if it is not, then you
also have this right And
00:39:49.991 --> 00:39:55.030
that's greater than or
equal to beta tan through t.
00:39:55.031 --> 00:39:59.518
But r squared minus v is the sum
of independent random variables,
00:39:59.519 --> 00:40:02.660
which means they're
all right y ones.
00:40:02.661 --> 00:40:06.097
y 1 squared is from there.
00:40:06.098 --> 00:40:08.179
It means each of them means 0.
00:40:09.360 --> 00:40:11.652
And they are independent.
00:40:11.653 --> 00:40:13.750
So put xi equals to
yi squared minus 1,
00:40:13.751 --> 00:40:15.721
the probability that
xi is greater or
00:40:15.722 --> 00:40:19.040
equal to delta minus 3 is what
we want we want to bound, okay.
00:40:20.560 --> 00:40:24.264
But we need to do the Sth
moments of xi, so we have to
00:40:24.265 --> 00:40:28.522
bound moments of Gaussian,
I mean this is not a big deal.
00:40:28.523 --> 00:40:30.952
I can't stand it but
let's try do that.
00:40:30.953 --> 00:40:35.874
So okay either y i is less than
1, then x i is at most less
00:40:35.875 --> 00:40:40.913
than 1 any power, or y is
greater than 1 then x i is not.
00:40:40.914 --> 00:40:48.008
So in any case we have that,
I've added both cases basically.
00:40:48.009 --> 00:40:50.561
So when I get that and that,
00:40:50.562 --> 00:40:55.560
I'm just writing down
the Gaussian moment, right?
00:40:59.524 --> 00:41:03.703
And I make this substitution,
you compute the moments by
00:41:03.704 --> 00:41:07.185
converting it to,
this is the gamma function,
00:41:07.186 --> 00:41:11.290
because we get either minus
that and that to some power.
00:41:11.291 --> 00:41:12.252
That's a gamma function.
00:41:15.315 --> 00:41:19.511
So we get 2 to the s
times z to the s okay.
00:41:19.512 --> 00:41:24.285
So the variances of most
eight the fixed number.
00:41:24.286 --> 00:41:28.450
But we do not have the.
00:41:28.451 --> 00:41:33.266
So for the theorem we wanted
the F moment to grow only
00:41:33.267 --> 00:41:36.519
as s factorial
times the variant.
00:41:36.520 --> 00:41:41.372
That's not true
because it grows as,
00:41:41.373 --> 00:41:47.749
where is it It goes as 2
to the s times s factorial.
00:41:47.750 --> 00:41:51.497
So the expected value
of xi to the s,
00:41:54.115 --> 00:41:56.297
2 to the s times s factorial.
00:41:56.298 --> 00:42:00.507
But we'd like no 2 to the s.
00:42:02.686 --> 00:42:04.523
Right, we want just
this factorial.
00:42:04.524 --> 00:42:07.570
We don't want the 2
to the s growth.
00:42:07.571 --> 00:42:09.841
So how will we avoid it?
00:42:09.842 --> 00:42:10.661
I mean, we cannot avoid it.
00:42:10.662 --> 00:42:14.898
I mean, there is something to
the s here because that's how
00:42:14.899 --> 00:42:17.200
the gamma function is, right?
00:42:18.500 --> 00:42:21.300
So up to this point, up to
the gamma function, it's exact.
00:42:21.301 --> 00:42:23.770
All except for the one,
this was exact.
00:42:23.771 --> 00:42:25.650
So I cannot get
better than that.
00:42:26.720 --> 00:42:30.210
But 2 to the s is lower order
than s factorial, right?
00:42:30.211 --> 00:42:33.100
s factorial is
roughly s to the s.
00:42:33.101 --> 00:42:36.600
2 to the s is less than that,
so it should be manageable.
00:42:36.601 --> 00:42:38.950
So what might be a trick?
00:42:38.951 --> 00:42:41.040
I mean, so what shall I do?
00:42:41.041 --> 00:42:44.060
So maybe I can ask you the
question and then try to do it.
00:42:44.061 --> 00:42:48.920
So I would like random
variables sth moment to grow
00:42:48.921 --> 00:42:53.420
as s factorial, basically.
00:42:53.421 --> 00:42:57.440
But this is growing as 2
to the s times s factorial.
00:42:57.441 --> 00:43:00.440
So what's a good way of
trying to do something?
00:43:00.441 --> 00:43:04.203
So I just divide by 2,
then the 2 to the s goes away.
00:43:04.204 --> 00:43:06.779
But the factor of 2 is not
a problem because we are very
00:43:06.780 --> 00:43:09.535
generous with constant
factors we don't care, right?
00:43:09.536 --> 00:43:11.949
So that's what will happen.
00:43:11.950 --> 00:43:15.570
We put wi = xi / 2 and
everything goes through.
00:43:15.571 --> 00:43:17.860
Okay, so now since
there are no questions,
00:43:17.861 --> 00:43:21.330
there are 15 minutes, we'll
have to go over a new stuff,
00:43:21.331 --> 00:43:23.050
which I can do on the board.
00:43:23.051 --> 00:43:25.450
But if you want,
you can ask questions instead.
00:43:26.530 --> 00:43:30.667
Okay, so now I have to write
on the board because, again,
00:43:30.668 --> 00:43:33.210
I'm going faster
than I expected.
00:43:33.211 --> 00:43:37.300
So what the problem
is asking is that.
00:43:37.301 --> 00:43:43.572
So now if x1 through x10 or
Normal(0,1) independent iid,
00:43:43.573 --> 00:43:48.613
then I would like, let's say,
the tail bounds for,
00:43:56.050 --> 00:43:59.110
Yeah, so
these are independent also.
00:43:59.111 --> 00:44:00.722
And the moments exist and
00:44:00.723 --> 00:44:03.415
we should be able to
get a bound for these.
00:44:03.416 --> 00:44:07.362
So that's a good point you
bring up in a way because for
00:44:07.363 --> 00:44:11.772
sum of squares, it's a standard
thing to try to have it on.
00:44:11.773 --> 00:44:13.746
This is probably less standard.
00:44:13.747 --> 00:44:19.195
But moments are still lengths,
so this also follows-
00:44:19.196 --> 00:44:24.174
>> [INAUDIBLE]
00:44:24.175 --> 00:44:24.979
>> Yeah, so
00:44:24.980 --> 00:44:28.465
I have to work that out
to get what we get.
00:44:28.466 --> 00:44:32.990
>> [INAUDIBLE]
>> Okay, we'll work it out.
00:44:32.991 --> 00:44:35.627
So next time, maybe I should
work this out and then bring it,
00:44:35.628 --> 00:44:38.187
right, since we're trying to do,
but it's had enough.
00:44:38.188 --> 00:44:41.639
It's a good question cuz
they should apply and
00:44:41.640 --> 00:44:45.355
get you a bound for
these normals cuz these are just
00:44:45.356 --> 00:44:48.204
independent and
sum of independent.
00:44:48.205 --> 00:44:51.186
So it'll be more exact
than 2 to the s.
00:44:51.187 --> 00:44:53.902
2 to the s times s
factorial is just a bound.
00:44:53.903 --> 00:44:59.005
But qualitatively,
the 2 sth moment of
00:44:59.006 --> 00:45:03.964
a normal Gaussian is
like s factorial.
00:45:03.965 --> 00:45:08.795
Well, it add terms
only up to 2 s.
00:45:08.796 --> 00:45:12.361
>> [INAUDIBLE]
>> Yeah, but
00:45:12.362 --> 00:45:17.221
the leading term is s to the s,
the c to the s times s to the s,
00:45:17.222 --> 00:45:21.304
because you have s things
you're multiplying.
00:45:23.915 --> 00:45:28.710
Right, you're multiplying 1, 3,
5, and so on up to s, or 2, 4,
00:45:28.711 --> 00:45:31.725
but there are s things
you're multiplying.
00:45:31.726 --> 00:45:34.994
So it's at least the starting
formula, or something,
00:45:34.995 --> 00:45:37.506
it's s to the s times
the constant to the s.
00:45:37.507 --> 00:45:39.760
Qualitatively, that's
the correct.
00:45:39.761 --> 00:45:43.890
So let me introduce
Johnson-Lindenstrauss,
00:45:43.891 --> 00:45:45.780
which we'll prove next time.
00:45:45.781 --> 00:45:50.770
So the idea is, You have
00:45:50.771 --> 00:45:55.950
points, Okay,
how do we denote them?
00:45:55.951 --> 00:46:02.849
x1, x2, up to xn in Rd, n large.
00:46:08.338 --> 00:46:11.820
There are many problems
in which this is used.
00:46:11.821 --> 00:46:19.061
But I want to start with
one concrete problem,
00:46:19.062 --> 00:46:25.938
which is estimate all
pairwise distances.
00:46:27.154 --> 00:46:29.797
There are n squared of them.
00:46:29.798 --> 00:46:31.602
And if I take a pair,
00:46:31.603 --> 00:46:35.757
it takes d time to find
the distances, right?
00:46:35.758 --> 00:46:38.290
There are d coordinates,
I have to subtract them and add.
00:46:38.291 --> 00:46:40.290
So it would seem to
take this much time.
00:46:46.214 --> 00:46:50.259
This theorem, what the
Johnson-Lindenstrauss theorem
00:46:50.260 --> 00:46:53.805
says is that if I have two
points, just two points,
00:46:53.806 --> 00:46:58.978
in d space, I
00:46:58.979 --> 00:47:02.180
can project them down to a much
smaller dimensional space.
00:47:10.116 --> 00:47:14.057
So that the distance between
them in the projection gives you
00:47:14.058 --> 00:47:17.390
a good estimate of the distance
between these two.
00:47:18.770 --> 00:47:23.720
They're the known scale factor,
but for that, it's almost exact.
00:47:23.721 --> 00:47:27.120
So let's see what we expect
here before we go on.
00:47:27.121 --> 00:47:29.920
So I have this point.
00:47:29.921 --> 00:47:32.120
Let's call this point x,
this point y.
00:47:32.121 --> 00:47:36.533
I want to estimate
(xi- yi) squared.
00:47:41.488 --> 00:47:43.512
So first, intuitively,
00:47:43.513 --> 00:47:47.100
and then we'll do
the actual proof next time.
00:47:47.101 --> 00:47:50.730
Suppose we choose a random
coordinate system.
00:47:56.840 --> 00:47:59.877
Random coordinate system just
means I take the regular
00:47:59.878 --> 00:48:02.660
coordinate system and
rotate it by a random mark.
00:48:04.220 --> 00:48:08.640
Another way of saying that is I
pick the x1 axis uniformly at
00:48:08.641 --> 00:48:13.260
random, and then x2 axis
uniformly at random,
00:48:13.261 --> 00:48:14.760
perpendicular to x1, and so on.
00:48:19.281 --> 00:48:23.423
So if I took a random
coordinate system intuitively,
00:48:23.424 --> 00:48:26.560
the xi- yi should all
be roughly equal.
00:48:28.410 --> 00:48:31.918
So roughly, xi- yi,
00:48:31.919 --> 00:48:38.050
all equal, Because they were the
random coordinate system, right?
00:48:38.051 --> 00:48:40.060
This is not a proof.
00:48:40.061 --> 00:48:41.174
I mean, again,
we'll prove it carefully.
00:48:41.175 --> 00:48:45.836
So roughly,
they should all be equal,
00:48:45.837 --> 00:48:49.520
and therefore, each xi- yi.
00:48:53.748 --> 00:48:57.613
Now if they are all equal,
00:48:57.614 --> 00:49:03.780
this was the length
of |x- y| squared.
00:49:03.781 --> 00:49:05.770
How would they be
related to x- y?
00:49:07.890 --> 00:49:10.499
They are d coordinates,
they are all equal,
00:49:10.500 --> 00:49:13.047
would I then have 1 /
d times this, right?
00:49:17.686 --> 00:49:20.067
No, not quite because
it's the progress, so
00:49:20.068 --> 00:49:21.356
only the squares add up.
00:49:21.357 --> 00:49:23.244
So this will be 1 / root d.
00:49:26.158 --> 00:49:29.655
Okay, so this is correct.
00:49:29.656 --> 00:49:32.745
It turns out to be
provable as a theorem.
00:49:32.746 --> 00:49:35.256
This is what was proven by
Johnson-Lindenstrauss in
00:49:35.257 --> 00:49:38.565
a 50-page paper, perhaps,
in the beginning.
00:49:38.566 --> 00:49:41.595
And the difficulty was
the random coordinate system.
00:49:43.590 --> 00:49:47.860
So this was difficult to do and
prove, but
00:49:47.861 --> 00:49:52.740
later proves the following
very simple thing.
00:49:53.883 --> 00:50:01.153
Pick a Gaussian vector, v.
00:50:01.154 --> 00:50:01.875
What that means
00:50:01.876 --> 00:50:05.240
is v is picked from
the distribution Normal(0,I).
00:50:05.241 --> 00:50:09.927
This involved phase, it's a
vector, right, picked from this.
00:50:09.928 --> 00:50:13.866
Then (x- y) dot
00:50:13.867 --> 00:50:18.860
product of length 1.
00:50:20.835 --> 00:50:23.410
Okay, so what should it be?
00:50:23.411 --> 00:50:28.289
What should the dot product
of this be in relation
00:50:28.290 --> 00:50:29.957
to x- y length?
00:50:29.958 --> 00:50:35.349
Based on that kind of intuition,
00:50:35.350 --> 00:50:39.910
what would you expect?
00:50:39.911 --> 00:50:42.452
This and that.
00:50:45.242 --> 00:50:46.950
Same 1 / root d, right,
00:50:46.951 --> 00:50:49.910
because this is again
just a random x1 axis.
00:50:52.490 --> 00:50:57.470
Now, sometimes you need a more
solid estimate because we have
00:50:57.471 --> 00:51:01.408
many points and
we want all pairwise distances.
00:51:01.409 --> 00:51:05.026
So what Johnson-Lindenstrauss
originally set is a random
00:51:05.027 --> 00:51:06.845
coordinate system, right?
00:51:06.846 --> 00:51:10.649
So the biggest difficulty
here was that x2 had to be
00:51:10.650 --> 00:51:13.230
perpendicular to x1, and so on.
00:51:13.231 --> 00:51:18.990
But instead,
it's enough to pick k of these,
00:51:18.991 --> 00:51:22.965
which are all
independent Normal(0,I).
00:51:25.499 --> 00:51:27.221
They won't be orthogonal, but
00:51:27.222 --> 00:51:30.616
we saw that they'll be close
to orthogonal, that's enough.
00:51:30.617 --> 00:51:35.320
That turns out then that (x- y)
00:51:35.321 --> 00:51:40.192
dot Vi squared, sum to k,
00:51:43.805 --> 00:51:48.568
Would be roughly k / d
times |x- y| squared.
00:51:48.569 --> 00:51:51.692
So this is saying I take k
coordinate directions and
00:51:51.693 --> 00:51:53.370
take the sum of squares.
00:51:53.371 --> 00:51:56.147
They are not quite,
they are not perpendicular, but
00:51:56.148 --> 00:51:57.697
it doesn't matter, right?
00:51:57.698 --> 00:51:59.354
This can be carried
out very easily, and
00:51:59.355 --> 00:52:01.093
we'll see that the proof
is very simple.
00:52:01.094 --> 00:52:04.442
Proof is a few lines
from the Gaussian.
00:52:04.443 --> 00:52:09.411
So lot of set goes here into
ensuring the orthogonality
00:52:09.412 --> 00:52:12.873
of these, but
that's unnecessary.
00:52:12.874 --> 00:52:15.441
We can just take
Gaussian vectors,
00:52:15.442 --> 00:52:18.016
they need not be
quite orthogonal.
00:52:18.017 --> 00:52:20.079
This is a quantity
we can easily find.
00:52:20.080 --> 00:52:24.381
Therefore, we might as
well deal with this.
00:52:24.382 --> 00:52:25.547
So we'll prove that next time.
00:52:25.548 --> 00:52:29.750
It'll be very simple again,
as I told you that this is true.
00:52:29.751 --> 00:52:32.103
And I'll probably give you a
couple of applications of that.