WEBVTT
00:00:00.580 --> 00:00:04.350
Hello, well I'm very excited
to start the tutorial.
00:00:04.350 --> 00:00:09.119
Our first speaker is
Vitaly Kuznetsov.
00:00:09.119 --> 00:00:10.840
He's a research scientist at Google.
00:00:10.840 --> 00:00:14.650
He did his PhD at Courant Institute
for Mathematics and
00:00:14.650 --> 00:00:18.010
his research interests
are mostly in non stationary
00:00:18.010 --> 00:00:20.790
time series analysis in both
theory and applications.
00:00:20.790 --> 00:00:21.300
No further ado.
00:00:25.510 --> 00:00:29.390
>> Thank you so
much for introduction.
00:00:29.390 --> 00:00:32.828
And I'm also very excited
to talk to you today about
00:00:32.828 --> 00:00:36.460
non-stationary time series.
00:00:36.460 --> 00:00:42.150
Actually let me start
by saying a few words.
00:00:42.150 --> 00:00:45.010
Why do we actually care about
00:00:45.010 --> 00:00:48.070
time series analysis in
the first place, right?
00:00:48.070 --> 00:00:49.770
Why are we all here today?
00:00:50.860 --> 00:00:55.420
In fact,
time series data appears in many,
00:00:55.420 --> 00:00:58.240
many real world
applications these days.
00:00:59.730 --> 00:01:03.269
Stock values,
different other economic variables,
00:01:05.270 --> 00:01:10.870
local temperatures or
global climate trends.
00:01:10.870 --> 00:01:15.750
All of these often come in
the form of time series data.
00:01:15.750 --> 00:01:19.850
In fact,
I personally like to argue that
00:01:19.850 --> 00:01:24.100
all of the data that we're
collecting is time series data.
00:01:24.100 --> 00:01:26.810
Precisely because we are collecting
this data over time, right?
00:01:28.660 --> 00:01:32.970
If you look at this list,
you will see that
00:01:32.970 --> 00:01:37.527
actually in a lot of
these applications it is
00:01:37.527 --> 00:01:42.577
critical to be able to do
time shares analysis and
00:01:42.577 --> 00:01:46.662
time shares forecasting accurately.
00:01:46.662 --> 00:01:51.735
If we're talking about stock values
then there could be millions or
00:01:51.735 --> 00:01:53.250
billions at stake.
00:01:54.430 --> 00:01:59.330
If we're talking about climate,
even future of our planet.
00:02:01.560 --> 00:02:06.610
But I also want to argue that
time share analysis not only
00:02:06.610 --> 00:02:11.470
very important problem, but it is
also often a very challenging task.
00:02:13.750 --> 00:02:17.920
So, let's look at maybe some
examples of time series here.
00:02:17.920 --> 00:02:22.720
So you probably have heard
about Google trends, right?
00:02:23.780 --> 00:02:27.068
So here's a plot of two time
series from Google trends,
00:02:27.068 --> 00:02:32.980
this is an interest over time
in two different search terms,
00:02:32.980 --> 00:02:35.530
machine learning and deep learning.
00:02:36.620 --> 00:02:41.230
And you look at this time series,
maybe you see some patterns there
00:02:41.230 --> 00:02:49.000
right, maybe some common trend
component, or any other patterns.
00:02:49.000 --> 00:02:53.590
But these initial intuition that
we may have from these simple
00:02:53.590 --> 00:02:57.260
plots often can actually
quite misleading.
00:02:57.260 --> 00:03:01.910
For instance,
if I were to zoom in on earlier
00:03:01.910 --> 00:03:06.649
segments of these time series,
say for 2012,
00:03:06.649 --> 00:03:11.090
2013, then what I would
see is a plot like that,
00:03:11.090 --> 00:03:14.950
which looks very different
from my original plot.
00:03:14.950 --> 00:03:16.840
And we can see that
this time series for
00:03:16.840 --> 00:03:21.180
instance, also most of them
look quite different, right?
00:03:21.180 --> 00:03:26.920
The machine learning time series
is sort of more volatile,
00:03:26.920 --> 00:03:29.209
maybe has some seasonal
component on the other.
00:03:30.420 --> 00:03:34.690
Hence deeper and
it gets more assertive, stable and
00:03:34.690 --> 00:03:36.160
doesn't vary that much over time.
00:03:38.110 --> 00:03:42.230
On the other hand if I were to
progress further in time and
00:03:42.230 --> 00:03:48.880
look at the more recent data 2016,
2015 period.
00:03:48.880 --> 00:03:52.182
Then picture changes once again.
00:03:52.182 --> 00:03:57.820
All right, now we see some sort
of trends, both time series
00:03:57.820 --> 00:04:04.120
are moving in a much more volatile
fashion and maybe surprisingly.
00:04:04.120 --> 00:04:07.410
Even we can see that, in some cases,
they are moving in opposite
00:04:07.410 --> 00:04:11.280
directions, which is also something
that is counterintuitive for
00:04:11.280 --> 00:04:12.690
these two search terms.
00:04:13.710 --> 00:04:18.950
So these kind of phenomena
of non-stationarity
00:04:18.950 --> 00:04:23.580
make time series analysis
very challenging and
00:04:23.580 --> 00:04:27.570
it make time series analysis
also very different from
00:04:27.570 --> 00:04:32.180
the usual machine learning that
we are used to, all right?
00:04:32.180 --> 00:04:34.750
Typically, what we expect
00:04:34.750 --> 00:04:39.230
to heal machine learning is
some sort of idea assumption.
00:04:39.230 --> 00:04:43.820
We expect that our
training distribution is
00:04:43.820 --> 00:04:48.320
going to be the same as
our test distribution.
00:04:48.320 --> 00:04:52.240
We expect that these
distributions are fixed and
00:04:52.240 --> 00:04:56.860
do not change over time,
which means they're stationary.
00:04:56.860 --> 00:05:01.770
But this is not the case for
time series analysis
00:05:01.770 --> 00:05:07.350
In general these assumptions violate
within these kind of scenarios.
00:05:07.350 --> 00:05:11.620
So how can we analyze
this kind of data?
00:05:11.620 --> 00:05:18.940
How can we learn from these kind
of known stationary kinds of data?
00:05:18.940 --> 00:05:21.330
These are the kind of
questions that will
00:05:21.330 --> 00:05:25.820
try to investigate in this
tutorial and try to answer them.
00:05:27.030 --> 00:05:30.830
And the plan for
today is going to be as follows.
00:05:30.830 --> 00:05:36.930
The first thing I'm going
to do is give a very quick
00:05:36.930 --> 00:05:42.289
30 minute introduction to classical,
traditional time series analysis.
00:05:43.470 --> 00:05:47.690
Then we're gonna have a very
short two minute break just to
00:05:47.690 --> 00:05:49.240
catch some breath.
00:05:49.240 --> 00:05:55.020
And then we're gonna discuss some
more recent learning theory that
00:05:55.020 --> 00:06:00.786
has been developed for this kind
of non-stationary time series.
00:06:00.786 --> 00:06:02.003
After that,
00:06:02.003 --> 00:06:06.980
after theory we're gonna take a
longer break, a whole four minutes,.
00:06:06.980 --> 00:06:10.860
And then after that we're
going to switch gears and
00:06:10.860 --> 00:06:16.140
talk a little bit about algorithmic
solutions that are based
00:06:16.140 --> 00:06:18.700
on this theory that we'll present.
00:06:18.700 --> 00:06:25.070
And finally, we'll discuss some
connections between time series
00:06:25.070 --> 00:06:30.110
prediction and this very hot topic
these days, that of online learning.
00:06:32.558 --> 00:06:36.030
All right, so
let us then jump right in and
00:06:36.030 --> 00:06:39.940
let us talk about a classical
time series analysis.
00:06:41.110 --> 00:06:46.001
The way classical time series
analysis works the way
00:06:46.001 --> 00:06:50.790
this classical framework
operates is as follows.
00:06:50.790 --> 00:06:55.445
The first thing that you
typically do you postulate some
00:06:55.445 --> 00:06:58.812
parametric model that
you assume that is
00:06:58.812 --> 00:07:02.689
generating the data
that you're observing.
00:07:04.860 --> 00:07:08.700
Then once you have done
this critical step,
00:07:08.700 --> 00:07:13.800
you take the data that you
observe and you use this data to
00:07:13.800 --> 00:07:19.302
estimate the unknown parameters
of this parametric model.
00:07:19.302 --> 00:07:23.160
Then finally once you have this
estimated model you'll use it for
00:07:23.160 --> 00:07:27.870
whatever purposes you want
to it be used enable for
00:07:27.870 --> 00:07:32.760
instance prediction, all right.
00:07:32.760 --> 00:07:35.110
So what kind of
models are out there?
00:07:35.110 --> 00:07:39.060
Well if you start thinking about
what you're trying to do right,
00:07:39.060 --> 00:07:42.770
somehow what you're
trying to do is to
00:07:42.770 --> 00:07:46.930
use your past observations to make
the forecasts about the future.
00:07:47.930 --> 00:07:52.820
So a very natural thing to do
would be to take lots of p
00:07:52.820 --> 00:07:58.570
observations, combine
them in a certain way and
00:07:58.570 --> 00:08:00.980
that would be your
forecast of the future.
00:08:00.980 --> 00:08:07.110
And that's precisely what,
so-called autoregressive models are.
00:08:07.110 --> 00:08:10.380
They are linear combinations of your
00:08:10.380 --> 00:08:15.110
past p observations with weights a1,
a2, up to ap.
00:08:16.800 --> 00:08:19.140
Which we call autoregressive
coefficients.
00:08:20.180 --> 00:08:23.030
And once you form these
linear combination of
00:08:23.030 --> 00:08:27.970
the previous observations Yt,
00:08:27.970 --> 00:08:32.880
Y t-1, Y t-2 up to Y t-p,
you add some noise term epsilon t.
00:08:33.940 --> 00:08:37.540
And you assume that that
this is going to be
00:08:37.540 --> 00:08:39.408
the next iteration in this sequence.
00:08:39.408 --> 00:08:43.640
So these noise terms
that you typically add,
00:08:43.640 --> 00:08:47.550
they're uncorrelated zero
mean random variables.
00:08:47.550 --> 00:08:51.940
And they are supposed to
capture that certainty
00:08:51.940 --> 00:08:56.879
that is not captured by the linear
combination of the past observation.
00:08:58.610 --> 00:09:00.820
In this point, you may say, well,
00:09:00.820 --> 00:09:05.830
why only restrict yourself to
the uncertainty of time T, right.
00:09:05.830 --> 00:09:10.180
Maybe uncertainty of previous times
also may play a role in this.
00:09:10.180 --> 00:09:13.650
And if you kind of start
thinking along these lines,
00:09:13.650 --> 00:09:18.020
then the model that
you're going to arrive at
00:09:18.020 --> 00:09:22.200
is going to be this so-called
moving average model,
00:09:22.200 --> 00:09:27.370
which assumes that the observation
00:09:27.370 --> 00:09:33.190
Yt is a combination of the
uncertainty at time t, epsilon t.
00:09:33.190 --> 00:09:38.564
And the uncertainties at
previous times, Epsilon t-1,
00:09:38.564 --> 00:09:45.040
epsilon t- 2 up to epsilon minus,
epsilon t- q.
00:09:45.040 --> 00:09:49.856
And they combine with certain
ways which we call moving
00:09:49.856 --> 00:09:52.795
average coefficients, right.
00:09:52.795 --> 00:09:55.585
And then combining these two models.
00:09:55.585 --> 00:10:00.148
The ultra regressive model of
the past observation The moving
00:10:00.148 --> 00:10:05.235
average model, we arrive to a very
classical and celebrated model,
00:10:05.235 --> 00:10:10.340
which is called ARMA model,
AutoRegressive Moving Average model.
00:10:11.860 --> 00:10:17.080
It was proposed, I think,
back in the 50s by Whittle,
00:10:17.080 --> 00:10:18.620
maybe even earlier, and
00:10:18.620 --> 00:10:24.050
it was actually popularized
by Box & Jenkins in the 70s.
00:10:26.080 --> 00:10:30.997
So, let us actually look
at an example of a process
00:10:30.997 --> 00:10:36.629
distributed according to ARMA model,
and here's one.
00:10:36.629 --> 00:10:42.310
Looks like a bunch of random noise,
ups and downs, some keeps,
00:10:42.310 --> 00:10:48.117
but one interesting thing that
you may observe is the following.
00:10:48.117 --> 00:10:52.389
If you take two chunks
of these time series,
00:10:52.389 --> 00:10:56.890
like say an earlier chunk and
a later chunk.
00:10:56.890 --> 00:11:01.890
If you look at these two chunks,
then in certain statistical sense
00:11:01.890 --> 00:11:06.000
you won't be able to distinguish
these two chunks, and
00:11:06.000 --> 00:11:11.350
this phenomena is what's
known as stationarity.
00:11:11.350 --> 00:11:14.740
More formally is defined
exactly in this way.
00:11:14.740 --> 00:11:19.320
If I take any chunk
of my time series and
00:11:19.320 --> 00:11:24.620
I shift it by k units, and
if I look at the distribution of
00:11:24.620 --> 00:11:28.370
the regional chunk and if I look at
the distribution of the new chunk,
00:11:28.370 --> 00:11:32.060
and if these two distributions
happen to be exactly the same,
00:11:32.060 --> 00:11:35.570
that's what we call for
stationarity.
00:11:36.770 --> 00:11:41.866
In fact, this is often referred
to as strong stationarity and
00:11:41.866 --> 00:11:46.374
weak stationarity, or
second moment stationarity,
00:11:46.374 --> 00:11:50.784
a property that only requires
basically the first and
00:11:50.784 --> 00:11:54.803
second moment of these
distributions to match.
00:11:54.803 --> 00:11:59.260
Now, we're more precisely
you want the mean of your
00:11:59.260 --> 00:12:03.516
statistic process to be
independent of time, and
00:12:03.516 --> 00:12:09.720
you won't the function of statistic
process to be independent of time.
00:12:09.720 --> 00:12:16.219
Which means that the covariance or
correlation of any ZT,
00:12:16.219 --> 00:12:20.329
ZT-J only depends on the lack J, but
00:12:20.329 --> 00:12:26.030
does not depend on
the exact time point, T.
00:12:26.030 --> 00:12:30.890
So, is our original intuition
about the ARMA process correct?
00:12:30.890 --> 00:12:32.380
Is it actually stationary, right?
00:12:32.380 --> 00:12:36.510
Cuz it seemed from the plot I have
shown to you that two chance would
00:12:36.510 --> 00:12:41.700
be kind of exactly the same,
so we need some tools.
00:12:41.700 --> 00:12:45.720
So basically for this analysis and
sort of the standard tool
00:12:45.720 --> 00:12:49.420
that you should be familiar with if
you want to do anything with these
00:12:49.420 --> 00:12:54.340
kind of ultra regressive models
is So-called Lag Operator.
00:12:54.340 --> 00:12:57.070
So it's a very simple thing.
00:12:57.070 --> 00:13:03.090
Basically what this Lag Operator is
that if you're given an observation
00:13:03.090 --> 00:13:09.380
Yt, and you apply this Lag
Operator L to this observation Yt,
00:13:09.380 --> 00:13:13.544
then what you're gonna get back is
the previous observation Yt- 1.
00:13:14.710 --> 00:13:15.240
Very simple.
00:13:16.500 --> 00:13:21.560
But then what you can do, you can
actually rewrite the ARMA model
00:13:21.560 --> 00:13:27.860
that we had a few slides before
in terms of these lag operators.
00:13:27.860 --> 00:13:31.640
In particular,
the way it's gonna look,
00:13:31.640 --> 00:13:36.750
you're gonna form this polynomial in
terms of the lag operator, right?
00:13:36.750 --> 00:13:44.220
So on the left hand side, I have
this 1 minus ai, the sum of ai,
00:13:45.500 --> 00:13:50.680
lag operator, to the power of i,
summed from i equals 1 to p.
00:13:50.680 --> 00:13:51.970
And they apply it to yt.
00:13:53.020 --> 00:13:57.540
And on the right hand
side they form a similar
00:13:57.540 --> 00:14:00.520
polynomial in terms
of the lag operator.
00:14:00.520 --> 00:14:03.035
And applied to the term.
00:14:03.035 --> 00:14:07.970
And actually if you just
work out the algebra and
00:14:07.970 --> 00:14:12.900
applied the definition of the lag
operator you'll arrive at exactly
00:14:12.900 --> 00:14:17.610
the same formulation that we had for
the ARMA model a few slides before.
00:14:19.690 --> 00:14:21.500
But the reason this is.
00:14:21.500 --> 00:14:26.080
Nice to write out the ARMA
model in this way,
00:14:26.080 --> 00:14:32.190
is because it exposes interesting
properties of this ARMA model,
00:14:32.190 --> 00:14:36.690
in terms of the Lag Operator, and
00:14:36.690 --> 00:14:40.320
the special polynomial that we
have on the left-hand side.
00:14:40.320 --> 00:14:41.639
In fact, we're gonna call this
00:14:42.990 --> 00:14:47.480
Polynomial on the left hand
side characteristic polynomial. Right.
00:14:47.480 --> 00:14:51.950
This polynomial that I denote by P
00:14:51.950 --> 00:14:59.670
here, P of Z is one minus the sum
of AI Z to the power of I.
00:14:59.670 --> 00:15:03.530
Sum from I Equals 1 to p, right?
00:15:03.530 --> 00:15:08.080
And it precisely characterizes
the stationary properties of ARMA
00:15:08.080 --> 00:15:09.680
processes.
00:15:09.680 --> 00:15:15.940
In particular, a very standard
result says the following.
00:15:15.940 --> 00:15:20.920
If all the roots of
the characteristic polynomial P And
00:15:20.920 --> 00:15:26.640
that we had on the previous slide
like outside of the unit circle.
00:15:26.640 --> 00:15:32.630
Then they ARMA process P cure is
going to be weakly stationary.
00:15:33.990 --> 00:15:34.530
Right.
00:15:34.530 --> 00:15:37.160
This is actually
a very simple result.
00:15:37.160 --> 00:15:40.940
In fact, the proof of this
result is also rather simple.
00:15:40.940 --> 00:15:43.750
It's just basically a blank
00:15:43.750 --> 00:15:47.670
What you know about
the roots of the polynomial.
00:15:47.670 --> 00:15:52.910
But in there interest of time,
I'm gonna skip the proof and
00:15:54.420 --> 00:15:56.620
in fact,
we're gonna post the slides online.
00:15:56.620 --> 00:15:59.900
So, you're gonna be able
to check out the proof for
00:15:59.900 --> 00:16:03.660
yourself if you want a reference.
00:16:03.660 --> 00:16:10.420
But let us actually go back to what
we started talking about originally.
00:16:10.420 --> 00:16:15.170
We were interested in modeling time
series that are non-stationary.
00:16:15.170 --> 00:16:17.130
Because they're much
more realistic for
00:16:17.130 --> 00:16:20.080
any application that
we want to tackle.
00:16:21.420 --> 00:16:24.360
But we have just said
that these classical ARMA
00:16:24.360 --> 00:16:26.220
models are stationary.
00:16:27.720 --> 00:16:28.370
Are we doomed?
00:16:28.370 --> 00:16:30.480
Can we actually change things?
00:16:30.480 --> 00:16:34.010
Can we actually use this
Generative framework for
00:16:35.520 --> 00:16:37.890
modelling a non
stationary time series.
00:16:37.890 --> 00:16:40.870
Well, it turns out that
this lag operator and
00:16:40.870 --> 00:16:43.240
this characteristic polynomial
00:16:43.240 --> 00:16:48.660
allow you to also build models of
non stationary time series as well.
00:16:48.660 --> 00:16:53.590
In particular one way to do
that is to use ARMA processes.
00:16:53.590 --> 00:16:59.030
Whose characteristic polynomials
do have few unit roots, all right.
00:16:59.030 --> 00:17:01.600
They have roots that
lie on the unit circle.
00:17:02.900 --> 00:17:06.700
So if that's the case then
you can factorize such
00:17:06.700 --> 00:17:11.690
polynomial into a polynomial R(z),
all right,
00:17:11.690 --> 00:17:14.584
which has all its roots
outside of the unit circle.
00:17:14.584 --> 00:17:20.010
And then polynomial (1- z) D.
00:17:20.010 --> 00:17:25.610
Which has a unit root
of multiplicity one.
00:17:25.610 --> 00:17:30.600
And if you have an Arima
process with such
00:17:30.600 --> 00:17:34.760
characteristic polynomial,
then this Arima process.
00:17:34.760 --> 00:17:40.630
Is going to be non-stationary and
thatâ€™s how
00:17:40.630 --> 00:17:46.310
this celebrated ARIMA models
of other D are defined.
00:17:46.310 --> 00:17:50.980
So in a way you can think
of this ARIMA models and
00:17:50.980 --> 00:17:54.930
I here by the way stands for
integrated as.
00:17:54.930 --> 00:17:59.650
ARMA models of
a process which applies
00:18:01.600 --> 00:18:09.190
polynomial one minus lag operator to
the power of D to the process Y, T.
00:18:10.370 --> 00:18:14.530
Take for instance the simple
case of D equals one.
00:18:14.530 --> 00:18:16.020
Right.
00:18:16.020 --> 00:18:22.590
So what ARIMA model of all the d
equals 1 says is that basically,
00:18:22.590 --> 00:18:27.910
Yt minus Yt minus 1 is
a stationary ARIMA process.
00:18:31.660 --> 00:18:35.010
So how do ARIMA
processes now look like?
00:18:35.010 --> 00:18:38.310
Well, You know, if you plot one,
you will get a picture like that,
00:18:38.310 --> 00:18:43.270
that's sort of much better in
terms of non-stationarity.
00:18:43.270 --> 00:18:48.440
Right, we get some at least
stochastic trend here perhaps.
00:18:48.440 --> 00:18:50.820
And you can keep playing this game,
00:18:50.820 --> 00:18:54.458
if you want to sort of
model all other kinds of.
00:18:54.458 --> 00:18:59.590
Sort of non-stationarity
phenomenas explicitly.
00:18:59.590 --> 00:19:05.080
You can come up with all sort
of extensions of models,
00:19:05.080 --> 00:19:07.350
there is a whole zoo of those.
00:19:07.350 --> 00:19:10.710
You can incorporate
a seasonal component,
00:19:10.710 --> 00:19:14.948
you can incorporate long
memory You can incorporate
00:19:14.948 --> 00:19:19.890
so-called fractional effects,
time varying coefficients.
00:19:21.920 --> 00:19:24.760
There is really no limit to that.
00:19:24.760 --> 00:19:26.540
No linear models, etc, etc.
00:19:28.830 --> 00:19:33.390
One probably extension that
any talk like that should
00:19:33.390 --> 00:19:37.492
mention is The one that
tries to incorporate
00:19:37.492 --> 00:19:43.030
non-stationarity in the second
moment of the time series.
00:19:43.030 --> 00:19:47.130
In other words, try to model
a non-stationary variance.
00:19:48.220 --> 00:19:52.129
And that's what's known
as a Garch model.
00:19:52.129 --> 00:19:57.755
And this model simply says well,
I want a non stationary variance.
00:19:57.755 --> 00:20:01.494
So I'm gonna assume that my variance
also is a stochastic process.
00:20:01.494 --> 00:20:08.550
Which is just an ARMA process and
that's how I define a GARCH model.
00:20:08.550 --> 00:20:13.138
So it was introduced by
Engle in the 80s and
00:20:13.138 --> 00:20:17.354
that's what he got
the Nobel Prize for.
00:20:21.762 --> 00:20:26.312
Now, here is an example
of an ARMA process.
00:20:26.312 --> 00:20:30.684
You can see that the volatility
kind of changes over time for
00:20:30.684 --> 00:20:32.100
this time series.
00:20:32.100 --> 00:20:34.840
And then you can continue coming up,
00:20:34.840 --> 00:20:38.350
as I said, with all sort
of other more fancy models.
00:20:38.350 --> 00:20:41.414
So another example is a state-space.
00:20:44.786 --> 00:20:49.974
Class of state-space models
which you can basically think of
00:20:49.974 --> 00:20:54.970
as just a continuous space
analog of Hidden Markov Models.
00:20:55.980 --> 00:21:02.560
The way they work, you assume that
there is some unobserved state Xt.
00:21:02.560 --> 00:21:06.588
That follows other
regressive models, right?
00:21:06.588 --> 00:21:12.033
And each next state,
Xt + 1 is simply B where B is some
00:21:12.033 --> 00:21:17.115
parameter of the model
times the previous state,
00:21:17.115 --> 00:21:20.392
Xt, plus a noise term, right?
00:21:20.392 --> 00:21:25.099
And then, the observations that
you actually get are assumed to be
00:21:25.099 --> 00:21:28.218
a linear combination of the state,
right?
00:21:28.218 --> 00:21:35.940
So Yt = A times Xt + a noise term.
00:21:38.080 --> 00:21:42.741
So again, example of a particular
00:21:42.741 --> 00:21:47.249
state-space process, right?
00:21:47.249 --> 00:21:52.274
And, really, you can continue as
I said, playing these games and
00:21:52.274 --> 00:21:56.776
keep coming up with different
generative models for this.
00:21:56.776 --> 00:21:59.313
So, we've talked a lot about
00:21:59.313 --> 00:22:04.085
actually this first step in
the classical framework for
00:22:04.085 --> 00:22:09.180
time series analysis,
postulating a model, right?
00:22:09.180 --> 00:22:12.570
So then, how can we perform
this second step, right?
00:22:12.570 --> 00:22:17.084
Estimation, how do you find
parameters of these kind
00:22:17.084 --> 00:22:20.910
of autoregressive or
state-space models.
00:22:22.020 --> 00:22:25.970
Well, I guess the first
thing that you probably
00:22:25.970 --> 00:22:30.425
would think about is maximum
likelihood estimation, right?
00:22:30.425 --> 00:22:35.630
Let's make a further assumption
about the noise terms
00:22:35.630 --> 00:22:39.120
that we have introduced, for
instance, the Ber Gaussian.
00:22:39.120 --> 00:22:44.308
Then, we can, maybe,
write down the log likelihood
00:22:44.308 --> 00:22:50.882
of the data that we're observing and
try to optimize the parameters
00:22:50.882 --> 00:22:56.209
of the model to minimize
the negative log likelihood.
00:22:56.209 --> 00:23:01.287
For certain other models, there
are also less general techniques
00:23:01.287 --> 00:23:05.730
such as method of moments or
conditional or unconditional
00:23:05.730 --> 00:23:10.536
least square estimation that you
can imply for instance to some
00:23:10.536 --> 00:23:14.270
of the autoregressive
models that I mentioned.
00:23:15.690 --> 00:23:21.000
But the question is, we're not in
ideal scenario anymore, right?
00:23:21.000 --> 00:23:23.920
Do these methods actually work,
right?
00:23:23.920 --> 00:23:27.689
Can we give any guarantees for
these kind of methods?
00:23:27.689 --> 00:23:31.817
And yeah, people actually
have worked on this and
00:23:31.817 --> 00:23:35.947
they have worked out some
learning guarantees for
00:23:35.947 --> 00:23:41.880
these kind of methods under some,
say, additional assumptions.
00:23:41.880 --> 00:23:45.550
Like for instance,
if I wanted to give a guarantee for
00:23:45.550 --> 00:23:50.180
ARMA model that I have described
to you earlier then I would
00:23:50.180 --> 00:23:54.460
impose an assumption
of invertibility.
00:23:54.460 --> 00:23:59.300
Which basically means that this
other characteristic polynomial
00:23:59.300 --> 00:24:04.390
that we had on the noise term in
the formulation of the ARMA model
00:24:04.390 --> 00:24:08.610
has to also have its roots
outside of the unit circle.
00:24:08.610 --> 00:24:12.100
And in that case the model,
the ARMA model is invertible.
00:24:13.330 --> 00:24:17.790
And if I have both say,
invertibility and
00:24:17.790 --> 00:24:22.090
stationarity for ARMA model,
then I can, for instance, give
00:24:22.090 --> 00:24:27.080
a guarantee that the least square
estimator of the autoregressive
00:24:27.080 --> 00:24:31.869
parameters of that model
will converge in probability
00:24:33.470 --> 00:24:36.920
to the true parameters of the model,
right?
00:24:36.920 --> 00:24:43.570
So, this is an asymptotic learning
guarantee for these kind of models.
00:24:43.570 --> 00:24:47.850
It also requires the model
to be correctly specified.
00:24:47.850 --> 00:24:53.540
But at least, it gives some hope
that these models can work.
00:24:54.960 --> 00:24:57.847
And I should also mention that,
of course,
00:24:57.847 --> 00:25:02.529
I'm talking here about just least
square estimator for ARMA models but
00:25:02.529 --> 00:25:06.120
you can come up with similar
asymptotic guarantees for
00:25:06.120 --> 00:25:10.202
other models and other estimators
that they have mentioned.
00:25:10.202 --> 00:25:15.805
So, actually, I'm gonna,
this more or less wraps up the sort
00:25:15.805 --> 00:25:22.280
of lightning fast introduction to
the classical time series analysis.
00:25:22.280 --> 00:25:27.640
I just wanna sort of emphasize
a few things that we talked about.
00:25:27.640 --> 00:25:31.793
First of all, there are many,
many generative models and
00:25:31.793 --> 00:25:36.370
you can come up with a lot of those
generative models yourself and
00:25:36.370 --> 00:25:40.124
model all this phenomenon
in whatever way you like.
00:25:40.124 --> 00:25:44.628
The learning guarantees for
these kind of models that we have
00:25:44.628 --> 00:25:48.470
discussed are only
typically asymptotic.
00:25:48.470 --> 00:25:52.820
And, they also require the model
to be correctly specified.
00:25:52.820 --> 00:25:58.730
So, they're not robust to meet
many specifications of the model.
00:25:58.730 --> 00:26:04.052
And finally, one of the major
difficulties actually
00:26:04.052 --> 00:26:09.254
applying these kind of
models is that, you actually
00:26:09.254 --> 00:26:15.890
need to come up with the model of
the non-stationarity yourself.
00:26:15.890 --> 00:26:21.310
So you need to go ahead and look
at the data and then say, this is
00:26:21.310 --> 00:26:27.320
I think what kind of non-stationary
phenomenon is happening in the data.
00:26:27.320 --> 00:26:32.256
And, this is the kind of model
that I'm going to impose.
00:26:32.256 --> 00:26:35.830
And then in the remaining
of this tutorial,
00:26:35.830 --> 00:26:40.951
we are gonna try to discuss some
new theory and some new learning
00:26:40.951 --> 00:26:45.994
algorithms that would try to
lift these sort of assumptions.
00:26:47.020 --> 00:26:50.129
So, we'll break now for
two minutes to just for
00:26:50.129 --> 00:26:51.853
you to catch your breath.
00:26:51.853 --> 00:26:58.218
And then, we'll continue with
the theory part of the tutorial.
00:27:02.304 --> 00:27:05.914
>> Okay, so we're gonna
continue with the tutorial.
00:27:12.564 --> 00:27:15.055
>> Our next speaker is Mario Mori.
00:27:15.055 --> 00:27:17.523
He's a professor of
computer science and
00:27:17.523 --> 00:27:19.850
mathematics at
the Quran Institute and
00:27:19.850 --> 00:27:23.105
he's also a research
consultant at Google Research.
00:27:26.321 --> 00:27:28.510
>> All right, thank you very much.
00:27:28.510 --> 00:27:33.257
First of all, I wanted to say that
I'm really, really impressed to
00:27:33.257 --> 00:27:37.108
have that many people here for
a time series tutorial.
00:27:37.108 --> 00:27:41.880
We knew that it was popular,
now that's just a proof.
00:27:41.880 --> 00:27:45.635
I promised to say this, this section
is about theory, learning theory.
00:27:45.635 --> 00:27:52.022
And some people asked me,
why do we need learning theory?
00:27:52.022 --> 00:27:56.704
And well, the answer is otherwise
you are left to do parameter
00:27:56.704 --> 00:28:00.756
search in a blind fashion and
there is no principal for
00:28:00.756 --> 00:28:02.749
designing an algorithm.
00:28:04.680 --> 00:28:07.250
Vitali gave a description
00:28:07.250 --> 00:28:10.320
of some classical techniques
in time series prediction.
00:28:11.610 --> 00:28:15.900
He also mentioned that there
are some insufficiencies for
00:28:15.900 --> 00:28:17.090
these techniques.
00:28:17.090 --> 00:28:19.160
In particular, they do not have
00:28:20.420 --> 00:28:24.539
the standard guarantees that we
expect for finite training samples.
00:28:25.780 --> 00:28:28.010
Their guarantees are asymptotic.
00:28:29.350 --> 00:28:34.280
They also need to, they also come
with a whole bunch of assumptions.
00:28:34.280 --> 00:28:37.575
So the concepts and
tools that I'm going to
00:28:37.575 --> 00:28:42.516
discuss in the next few slides
are hopefully gonna be helpful for
00:28:42.516 --> 00:28:47.202
designing algorithms for
time series in a more general way.
00:28:47.202 --> 00:28:52.448
So let me start with a, So let me
00:28:52.448 --> 00:28:58.680
start with a full mode description
of time series forecasting.
00:28:58.680 --> 00:29:03.870
In that scenario, the learner
receives a finite training sample
00:29:03.870 --> 00:29:09.630
which means a finite realization
of a stochastic process X1,
00:29:09.630 --> 00:29:13.350
Y1, X2, Y2, XT, YT.
00:29:13.350 --> 00:29:20.210
Where the X is the X sub Ts are in
some input space capital X.
00:29:20.210 --> 00:29:25.640
And the Y sub Ts in the output
space, the label space, capital Y.
00:29:25.640 --> 00:29:30.410
And I will be denoting by capital Z,
the cross product of X and Y.
00:29:30.410 --> 00:29:35.283
And then, we assumed that we have
a loss function that is bounded
00:29:35.283 --> 00:29:39.728
by 1 and that takes as arguments
a hypothesis, little h.
00:29:39.728 --> 00:29:43.380
In some hypothesis, that capital H
of function is mapping from x to y.
00:29:44.870 --> 00:29:49.874
And well, an element of Z
that means the pair X and Y.
00:29:49.874 --> 00:29:54.778
So the learners problem
is to use the hypothesis,
00:29:54.778 --> 00:29:59.679
capital H, to come up with
a function, little h,
00:29:59.679 --> 00:30:04.381
that has a small Generalization,
if you wish.
00:30:04.381 --> 00:30:07.790
Or path dependence expected loss.
00:30:08.800 --> 00:30:11.740
And that is defined in
the bottom of the slide and
00:30:11.740 --> 00:30:16.300
it will also give me the opportunity
to describe the notation.
00:30:16.300 --> 00:30:23.370
So the calligraphic L(h,Z1T).
00:30:23.370 --> 00:30:27.140
So for
first bold Z1T denotes the sequence
00:30:27.140 --> 00:30:30.740
of the Zs training sample,
if you switch from 1 to T.
00:30:31.800 --> 00:30:36.693
And it is defined as the expectation
at time capital T + 1.
00:30:36.693 --> 00:30:41.670
Under the time at which you wish
to make a forecast of the loss of
00:30:41.670 --> 00:30:46.590
the hypothesis For
that ZT + 1 drawn from
00:30:46.590 --> 00:30:51.690
that distribution, conditioned on
what you have observed so far.
00:30:51.690 --> 00:30:53.868
That means Z1T.
00:30:55.664 --> 00:31:00.240
Now the standard assumptions
that are made in the analysis of
00:31:00.240 --> 00:31:05.000
time series forecasting, or
as stationarity and mixing.
00:31:05.000 --> 00:31:09.740
Stationarity was already
described and mentioned by Vitali.
00:31:09.740 --> 00:31:10.514
In short,
00:31:10.514 --> 00:31:14.731
it means that the distribution
is invariant to shifting.
00:31:16.677 --> 00:31:22.519
Mixing is a different type of
assumption, which essentially
00:31:22.519 --> 00:31:27.684
says that events capital
A that occur after time N + K,
00:31:27.684 --> 00:31:32.964
have a dependency with
the specter events capital B that
00:31:32.964 --> 00:31:38.610
occur before little n,
that decays with this gap little k.
00:31:39.759 --> 00:31:46.140
And this decaying dependency can be
measured in many different ways.
00:31:46.140 --> 00:31:50.880
One way to do so, is to use
the so-called beta mixing parameter.
00:31:50.880 --> 00:31:53.010
There are many other
ways of measuring that.
00:31:53.010 --> 00:31:55.310
Alpha mixing, phi mixing.
00:31:55.310 --> 00:32:02.474
There's almost any Greek letter
that you could append mixing to.
00:32:02.474 --> 00:32:05.257
So happens that beta
mixing is actually,
00:32:05.257 --> 00:32:07.408
one could argue the right view.
00:32:07.408 --> 00:32:11.329
And I'm not going to detail exactly
what technically beta mixing
00:32:11.329 --> 00:32:12.184
consists of,
00:32:12.184 --> 00:32:16.870
because you'll see in a minute
I'm gonna criticize all of these.
00:32:16.870 --> 00:32:18.350
And make a lot of enemies, I'm sure.
00:32:20.080 --> 00:32:24.980
So, but before doing so, let me just
say that most people who have been
00:32:24.980 --> 00:32:28.740
studying from the learning theory
point of view time series, and
00:32:28.740 --> 00:32:31.580
that includes me and
many of my colleagues.
00:32:31.580 --> 00:32:36.850
They have been adopting these
assumptions stationarity and mixing.
00:32:36.850 --> 00:32:41.848
And that includes the work by
Vidyasagar of PAC-learning,
00:32:41.848 --> 00:32:47.246
where he's actually arguing that
beta mixing is the right view.
00:32:47.246 --> 00:32:52.933
The generalization bounds given by
Yu in terms of the VC-dimension for
00:32:52.933 --> 00:32:57.295
binary classification,
covering number balance for
00:32:57.295 --> 00:32:59.776
regression given by Ron Meir.
00:32:59.776 --> 00:33:05.677
We have been giving with action with
complexity data dependent bounds for
00:33:05.677 --> 00:33:07.559
general loss functions,
00:33:07.559 --> 00:33:11.520
which about gives more
general view of all of this.
00:33:11.520 --> 00:33:15.460
And then more recently there have
been PAC-Bayesian bounds given by
00:33:15.460 --> 00:33:19.690
Alquier, all of these are making
this assumptions of stationarity and
00:33:19.690 --> 00:33:21.218
beta mixing.
00:33:21.218 --> 00:33:25.730
That includes even
algorithm dependent studies
00:33:25.730 --> 00:33:31.340
such as this study of
AdaBoost by Lozano et al.
00:33:31.340 --> 00:33:37.820
Again, our study of stability stable
algorithms in a very general way,
00:33:37.820 --> 00:33:39.570
and many other studies of this kind.
00:33:41.240 --> 00:33:44.580
All these raise some
fundamental questions.
00:33:45.990 --> 00:33:52.210
Because stationarity and mixing,
these assumptions often do not hold.
00:33:52.210 --> 00:33:54.840
They in fact typically do not hold.
00:33:54.840 --> 00:33:57.680
Just think trend or
periodic signals.
00:33:59.140 --> 00:34:02.280
Furthermore, these
assumptions are not testable.
00:34:02.280 --> 00:34:03.950
So how do we know if they even hold?
00:34:05.690 --> 00:34:06.870
It's even worse than this,
00:34:06.870 --> 00:34:09.450
because even if you assume
that mixing does hold.
00:34:10.950 --> 00:34:12.490
Make it even simpler.
00:34:12.490 --> 00:34:16.810
Assume that you even know what
form that beta mixing for
00:34:16.810 --> 00:34:18.310
example would have.
00:34:18.310 --> 00:34:21.530
Assume for example,
that it has an exponential form.
00:34:21.530 --> 00:34:23.440
Even if you do assume that,
00:34:23.440 --> 00:34:27.220
estimating these parameters turns
out to be a very difficult problem.
00:34:29.360 --> 00:34:34.950
But perhaps even worse than
this is that these assumptions,
00:34:34.950 --> 00:34:36.390
these notions,
00:34:36.390 --> 00:34:42.380
do not capture some fundamental
parameters of the problem.
00:34:42.380 --> 00:34:46.356
Those that have structure, that
means the hypothesis set which we
00:34:46.356 --> 00:34:49.409
know in machine learning
cannot be the family of all
00:34:49.409 --> 00:34:53.893
measurable functions otherwise we
cannot learn, and the last function.
00:34:55.782 --> 00:34:59.360
So this raises some questions.
00:34:59.360 --> 00:35:04.156
Is learning with
general,non-stationary, non-mixing,
00:35:04.156 --> 00:35:07.144
stochastic processes even possible?
00:35:09.712 --> 00:35:14.579
And can we design algorithms that
come with theoretical guarantees,
00:35:14.579 --> 00:35:17.540
for such broad and
general scenarios?
00:35:17.540 --> 00:35:20.200
Again, just think that those
assumptions really typically
00:35:20.200 --> 00:35:20.740
do not hold.
00:35:22.310 --> 00:35:24.909
We need a new tool for
this analysis.
00:35:26.220 --> 00:35:30.770
So, to do so let's just
start from first principles.
00:35:31.780 --> 00:35:34.320
Suppose you have a single
hypothesis h, so
00:35:34.320 --> 00:35:39.100
fixed h,
what are we really interested in?
00:35:39.100 --> 00:35:43.060
What we care about is
the expected loss or,
00:35:43.060 --> 00:35:48.100
remember again that conditional
expectation of the loss given
00:35:48.100 --> 00:35:52.910
the sample at time t
+ 1 of forecasting.
00:35:52.910 --> 00:35:57.450
And what we receive is
just training points.
00:35:57.450 --> 00:36:01.780
So the difference between the
expectation that expected loss at
00:36:01.780 --> 00:36:06.780
time t + 1, and
the expected loss of the same
00:36:06.780 --> 00:36:11.120
hypothesis h at time t,
is really the key quantity here.
00:36:11.120 --> 00:36:13.350
That is what we really care about.
00:36:13.350 --> 00:36:18.496
So of course, we're getting a full
sample from time 1 to capital T,
00:36:18.496 --> 00:36:23.201
so it's the average of this
quantity, that becomes the really
00:36:23.201 --> 00:36:28.201
important difference that is
related to time series prediction.
00:36:30.303 --> 00:36:33.080
This is for a single hypothesis.
00:36:33.080 --> 00:36:38.600
If you take the supremum
over all hypothesis
00:36:38.600 --> 00:36:43.630
this gives a definition, a quantity,
that we call discrepancy.
00:36:44.990 --> 00:36:50.090
And so you can see at the top of
the slide, it's the supremum over
00:36:50.090 --> 00:36:55.030
all hypotheses of that average
difference that I mentioned before.
00:36:56.120 --> 00:37:00.240
Notice that this quantity, unlike
notions that are purely related to
00:37:00.240 --> 00:37:04.870
the stochastic process,
captures the loss function.
00:37:04.870 --> 00:37:07.565
And it's directly depending
on the hypothesis's.
00:37:10.560 --> 00:37:16.930
Perhaps even more importantly, this
quantity can be actually estimated
00:37:16.930 --> 00:37:22.599
from data, as we will see under
some relatively mild assumptions.
00:37:24.580 --> 00:37:29.360
Notice that the discrepancy
delta is 0 in the IID case.
00:37:29.360 --> 00:37:31.940
That's really straight
forward to see.
00:37:31.940 --> 00:37:37.397
And one can show also that in
some broader cases for example,
00:37:37.397 --> 00:37:44.730
for weakly stationary processes with
linear hypotheses and squared loss.
00:37:44.730 --> 00:37:48.450
The discrepancy's also just 0.
00:37:48.450 --> 00:37:52.370
This definition of discrepancy
is based on an average of those
00:37:52.370 --> 00:37:53.710
differences.
00:37:53.710 --> 00:37:56.218
The average here is uniform.
00:37:56.218 --> 00:38:02.110
But, we might actually wish to
use an even richer definition,
00:38:02.110 --> 00:38:06.760
still really geared towards
the analysis of this problem,
00:38:06.760 --> 00:38:09.850
which we call
the weighted discrepancy.
00:38:09.850 --> 00:38:10.580
And we're now,
00:38:10.580 --> 00:38:16.270
we are not just using a uniform
average of these differences.
00:38:16.270 --> 00:38:21.220
But instead we use some weights,
q1, q2, q sub capital T,
00:38:22.700 --> 00:38:26.040
for defining that averaging.
00:38:26.040 --> 00:38:30.770
These weights, the q sub t's,
are going to be crucial.
00:38:31.800 --> 00:38:36.240
Particularly when it's gonna
come to an algorithmic design.
00:38:36.240 --> 00:38:37.130
Why?
00:38:37.130 --> 00:38:43.070
Because you would wish
to emphasize points T,
00:38:43.070 --> 00:38:49.390
XTs or ZTs in your sample who's
difference, who's expected
00:38:49.390 --> 00:38:53.950
losses is close to the time at
which you're gonna be forecasting.
00:38:53.950 --> 00:38:59.160
And on the contrary, deemphasize
those sample points for which
00:38:59.160 --> 00:39:04.000
the expected loss is much further
away from the time of prediction.
00:39:04.000 --> 00:39:08.110
These weights q sub t's are going
to help us, do this emphasizing and
00:39:08.110 --> 00:39:08.829
deemphasizing.
00:39:11.110 --> 00:39:12.760
Two words about this.
00:39:12.760 --> 00:39:19.469
That this definition of discrepancy
or weighted discrepancy,
00:39:19.469 --> 00:39:24.937
is strictly generalizing
previous definition that
00:39:24.937 --> 00:39:30.527
we have been giving still
in the context of drifting,
00:39:30.527 --> 00:39:35.995
with Andres Munoz Medina or
for Domain adaptation,
00:39:35.995 --> 00:39:39.995
with and with Karina Cortez.
00:39:39.995 --> 00:39:45.490
And that also is related to
the notion of discrepancy or
00:39:45.490 --> 00:39:50.148
a distance that David and
others, and
00:39:50.148 --> 00:39:55.296
also and
others have been giving the pass.
00:39:55.296 --> 00:39:58.742
We can also give upper bounds for
this discrepancy,
00:39:58.742 --> 00:40:03.220
in terms of some familiar quantities
such as the relative entropy.
00:40:03.220 --> 00:40:05.249
But again if we were to do so,
00:40:05.249 --> 00:40:10.156
we would be losing this interesting
aspect of the discrepancy that is
00:40:10.156 --> 00:40:14.318
taking into account the loss
function on the hypothesis.
00:40:14.318 --> 00:40:18.590
We can also give upper bounds in
terms of other quantities such as
00:40:18.590 --> 00:40:23.181
the five mixing coefficients of
asymptotically stationary process
00:40:23.181 --> 00:40:26.930
four of course asymptotically
stationary processes.
00:40:26.930 --> 00:40:30.510
I mentioned that the discrepancy
can be estimated, so
00:40:30.510 --> 00:40:32.710
let me discuss this here
little bit on this slide.
00:40:33.730 --> 00:40:38.370
The discrepancy can be
decomposed into two parts.
00:40:38.370 --> 00:40:42.750
Suppose that you wish to make
a prediction at time T plus one.
00:40:44.662 --> 00:40:48.460
There's gonna be some interval,
say small interval of like
00:40:48.460 --> 00:40:53.080
s just before the time of prediction
at which you don't expect
00:40:53.080 --> 00:40:57.760
the distribution to
have changed radically.
00:40:57.760 --> 00:41:01.550
So you sort of expect that the blue
part, the blue arrow here,
00:41:01.550 --> 00:41:06.710
that indicates the difference
between the discrepancy if you wish,
00:41:06.710 --> 00:41:09.410
between the last time, T+1,
00:41:09.410 --> 00:41:14.140
and the average loss within
that interval of length, s.
00:41:14.140 --> 00:41:16.520
You expect that not to be too large.
00:41:16.520 --> 00:41:19.200
You expect that actually
to be relatively small.
00:41:19.200 --> 00:41:21.930
In fact,
I'm going to be a bit further,
00:41:21.930 --> 00:41:27.550
you can prove that is this is not
the case, learning is not possible.
00:41:28.740 --> 00:41:33.260
Okay, so we have to make that
assumption that there is going to be
00:41:33.260 --> 00:41:37.840
some s, for which this
difference is not too large.
00:41:37.840 --> 00:41:41.850
Assuming that now, which seems
to be a natural assumption,
00:41:41.850 --> 00:41:46.110
then we can break the discrepancy
into by the sum of two terms.
00:41:46.110 --> 00:41:49.920
One is the term that
I just described,
00:41:49.920 --> 00:41:54.820
which at the top of this
slide is denoted by delta s,
00:41:54.820 --> 00:41:57.140
s referring to the size
of the interval.
00:41:57.140 --> 00:41:59.150
And the other one is the first term.
00:41:59.150 --> 00:42:03.650
Which is the discrepancy between
that interval of length s,
00:42:03.650 --> 00:42:06.850
and whatever else you have
had in your training sample.
00:42:07.860 --> 00:42:13.754
And that is the first term that you
see, that delta of q up there, okay.
00:42:13.754 --> 00:42:18.608
And turns out that that's
a delta zero of q.
00:42:18.608 --> 00:42:22.500
That delta zero of q, which is the
first term upper bounding delta q.
00:42:23.550 --> 00:42:27.340
Notice that it's all in terms
of the expected loss again, or
00:42:27.340 --> 00:42:30.320
conditional expectation
that calligraphic L.
00:42:31.670 --> 00:42:33.530
And it turns out that that term,
00:42:33.530 --> 00:42:37.590
delta zero of Q can be in fact
accurately estimated from data.
00:42:37.590 --> 00:42:38.970
Not so surprisingly,
00:42:38.970 --> 00:42:41.930
right, because what would you
do if you were to estimate it?
00:42:43.610 --> 00:42:45.690
Well you would consider
the same term,
00:42:45.690 --> 00:42:48.520
with the catagraphic l's
replaced by the true loss.
00:42:48.520 --> 00:42:50.990
The losses you've
observed on the sample.
00:42:50.990 --> 00:42:54.340
Turns out that is actually
a very good estimate of this
00:42:54.340 --> 00:42:55.600
quantity, okay.
00:42:56.610 --> 00:42:59.640
So now if you use this
notion of weighted
00:42:59.640 --> 00:43:01.630
discrepancy that I introduced.
00:43:01.630 --> 00:43:06.260
Which again, essentially comes
from the first principles.
00:43:06.260 --> 00:43:12.960
And you apply some general tools
that I will briefly describe,
00:43:12.960 --> 00:43:17.680
you obtain the following learning
guarantee which is very general.
00:43:19.220 --> 00:43:23.660
This theorem says that,
with high probability for
00:43:23.660 --> 00:43:28.040
all hypothesis h and capital H.
00:43:28.040 --> 00:43:32.273
And for all alpha, alpha I think of
it as a small number, let's say 1/T
00:43:32.273 --> 00:43:36.580
or 1/ square root of T,
where T is the size of your sample.
00:43:36.580 --> 00:43:41.850
So, for all hypothesis,
the left hand side that is
00:43:41.850 --> 00:43:46.060
the path dependent expected
loss of the hypothesis h.
00:43:46.060 --> 00:43:49.680
The quantity that you wish
to minimize is upper bounded
00:43:49.680 --> 00:43:52.900
by the first term,
which is the weighted
00:43:54.090 --> 00:43:59.020
training loss where the weights
are these Q sub Ts that
00:43:59.020 --> 00:44:02.730
you might wish to play
with algorithmically.
00:44:02.730 --> 00:44:06.270
Plus a term that is exactly
the weighted discrepancy.
00:44:07.610 --> 00:44:09.720
Alpha or two alpha, I already said,
00:44:09.720 --> 00:44:15.000
you can essentially disregard
it as being a small number, plus
00:44:15.000 --> 00:44:21.360
the norm of the weight vector,
q1 qt, that you consider.
00:44:21.360 --> 00:44:23.680
Think that if you
use uniform weights,
00:44:23.680 --> 00:44:26.780
that norm two of q
is just one over t.
00:44:26.780 --> 00:44:33.450
Okay one of us score of it and
then the square root of 2 times
00:44:33.450 --> 00:44:38.550
the log of a quantity that is the
expectation of a covering number.
00:44:38.550 --> 00:44:41.210
Okay I think I've
scared you enough now
00:44:41.210 --> 00:44:44.510
actually raise your hand
if you're not scared yet.
00:44:46.630 --> 00:44:50.680
Okay, I managed to scare a few,
then.
00:44:50.680 --> 00:44:53.000
So this is the expected
covering number.
00:44:53.000 --> 00:44:56.080
For now, just think of it as
a measure of the complexity
00:44:56.080 --> 00:44:59.540
of the family of
hypothesis capital G.
00:44:59.540 --> 00:45:04.470
Which is the family of the losses
of the hypothesis little h.
00:45:04.470 --> 00:45:06.720
That's what's indicated
in the bottom.
00:45:06.720 --> 00:45:09.190
This is the family of
functions that are mapping Z
00:45:09.190 --> 00:45:11.330
to the loss of h comma Z.
00:45:12.580 --> 00:45:17.870
The exact definition of this
sequential covering number, or
00:45:17.870 --> 00:45:22.730
expected sequential covering number,
I will try to just say a few words
00:45:22.730 --> 00:45:27.880
about a bit later, but for now think
that this type of learning guarantee
00:45:27.880 --> 00:45:32.020
Is actually not very different
from what we have seen up to now.
00:45:32.020 --> 00:45:33.270
What does it say,
00:45:33.270 --> 00:45:37.700
in short it's simply saying that the
generalization error is bounded by
00:45:37.700 --> 00:45:42.710
the empirical error plus the last
term is just a complexity term.
00:45:42.710 --> 00:45:47.850
All of these familiar, the only term
really is that weighted discrepancy.
00:45:48.950 --> 00:45:49.940
Okay, the delta of q.
00:45:49.940 --> 00:45:52.880
It turns out that that delta
of q is really crucial.
00:45:52.880 --> 00:45:56.470
In fact for some scenarios,
say drifting for example,
00:45:56.470 --> 00:46:00.370
you can show that you cannot
do away with that discrepancy.
00:46:00.370 --> 00:46:03.080
It's essentially the best
measure here that you can
00:46:03.080 --> 00:46:04.830
come up with, okay?
00:46:04.830 --> 00:46:06.919
So the discrepancy is really there.
00:46:07.960 --> 00:46:11.725
Furthermore, and Vitaly will
get into the details of this,
00:46:11.725 --> 00:46:15.530
is sort of learning guarantees
can really directly guide
00:46:15.530 --> 00:46:17.340
the design of our algorithms.
00:46:17.340 --> 00:46:17.990
How?
00:46:17.990 --> 00:46:22.750
Well the choice of the hypothesis
little h and the choice of these
00:46:22.750 --> 00:46:26.839
weights q one qt that I mentioned
are really, really crucial here.
00:46:27.900 --> 00:46:31.368
Can be precisely
following this bound,
00:46:31.368 --> 00:46:34.433
if you wish to minimize this bound.
00:46:34.433 --> 00:46:37.170
Again, Vitaly will say
a lot more about this.
00:46:37.170 --> 00:46:40.280
Now there is still a problem
here because delta
00:46:40.280 --> 00:46:43.250
of q is that true discrepancy.
00:46:43.250 --> 00:46:45.990
We don't have that true discrepancy,
but
00:46:45.990 --> 00:46:48.720
the good news is we can estimate it.
00:46:48.720 --> 00:46:52.290
So we just have to work
with the estimate of that.
00:46:52.290 --> 00:46:57.760
To result I mentioned that before
the discrepancy can be estimated and
00:46:57.760 --> 00:47:00.740
there is an other learning bound
00:47:00.740 --> 00:47:04.010
that is in terms of the estimate
of that discrepancy.
00:47:05.190 --> 00:47:07.410
So that's the delta hat of q.
00:47:08.520 --> 00:47:10.970
Which previously was denoted
00:47:10.970 --> 00:47:14.220
which was the estimate
of that delta zero of q.
00:47:14.220 --> 00:47:16.220
If you look at that delta hat,
00:47:16.220 --> 00:47:19.960
it's just precisely the same
expression as delta zero of q where
00:47:19.960 --> 00:47:23.079
I have replaced the calligraphic
l by capital L.
00:47:24.480 --> 00:47:25.020
Okay?
00:47:25.020 --> 00:47:29.200
So it's just precisely
the empirical version of that.
00:47:29.200 --> 00:47:30.640
So this learning bound,
00:47:30.640 --> 00:47:33.920
which only defers from the previous
one by the presence of
00:47:33.920 --> 00:47:38.360
the empirical discrepancies
versus the true discrepancies
00:47:38.360 --> 00:47:42.780
is going to be a quantity that
now you can directly work with
00:47:42.780 --> 00:47:46.010
when it comes to algorithms and
when it comes to guarantees.
00:47:47.030 --> 00:47:52.030
This learning bond is
probably the first
00:47:52.030 --> 00:47:57.540
learning guarantee that is given for
general stochastic processes.
00:47:57.540 --> 00:48:01.020
There is no assumption
about stationary or mixing.
00:48:01.020 --> 00:48:04.250
The quantity, really, that is
critical here is the discrepancy or
00:48:04.250 --> 00:48:06.240
the weighted discrepancy.
00:48:06.240 --> 00:48:06.980
And, again,
00:48:06.980 --> 00:48:10.500
I have already said what I thought
about deporting of those weights.
00:48:10.500 --> 00:48:11.730
Okay.
00:48:11.730 --> 00:48:15.550
So, if you have survived so
far, you've already
00:48:15.550 --> 00:48:21.360
had the main take aways
from this theory section.
00:48:21.360 --> 00:48:24.220
But now,
I'm gonna say a few more words, and
00:48:24.220 --> 00:48:27.430
that's only for
those who can still survive, but
00:48:27.430 --> 00:48:33.240
the definition of the weighted
sequential covering number.
00:48:34.570 --> 00:48:39.220
To tell you that in the case
of stochastic processes,
00:48:39.220 --> 00:48:42.640
we cannot just use
the standard covering number.
00:48:42.640 --> 00:48:45.220
Instead, we have to have
a sequential version of that.
00:48:46.360 --> 00:48:48.280
So a sequential,
00:48:48.280 --> 00:48:52.950
with a sequential cover of
a tree is defined as follows.
00:48:52.950 --> 00:48:54.630
Suppose you have a full tree,
00:48:54.630 --> 00:48:56.300
such as the one that is
displayed on this slide.
00:48:57.915 --> 00:48:59.765
And a full tree of depth capital t.
00:49:01.485 --> 00:49:04.565
And let's call that tree, bold z.
00:49:04.565 --> 00:49:07.695
That means that every
node of that full tree
00:49:07.695 --> 00:49:11.275
is labeled with some
element of capital z.
00:49:11.275 --> 00:49:14.375
Z1 for the root.
00:49:14.375 --> 00:49:16.205
Z2, Z3, etc., etc.
00:49:17.330 --> 00:49:23.480
And what a cover of that tree is,
is a set of tree capital V.
00:49:24.660 --> 00:49:28.990
Such that no matter what path you
choose in this tree, for example,
00:49:28.990 --> 00:49:36.430
if you were choosing the red path
there would be a tree in capital V.
00:49:36.430 --> 00:49:42.060
A tree v capital V such that
along this path, the vector v and
00:49:42.060 --> 00:49:47.510
the vector of g of z's would
be very close to each other.
00:49:47.510 --> 00:49:52.131
In fact they're know more of
the differences with these
00:49:52.131 --> 00:49:56.372
vectors will be bound by
alpha divided by norm of q.
00:49:56.372 --> 00:50:01.647
Okay, that's for example,
what's indicated in the corner
00:50:01.647 --> 00:50:06.425
bottom corner of the slide for
the particular red path.
00:50:06.425 --> 00:50:11.881
So when that falls in general,
that means for any tree Z, where you
00:50:11.881 --> 00:50:17.044
label the tree with the Zis and
their existing v such that you can
00:50:17.044 --> 00:50:23.550
be closed to that g of zs along that
path when this is true every time.
00:50:23.550 --> 00:50:28.066
Then they cap, the family of trees,
capital V, provide
00:50:28.066 --> 00:50:33.890
a alpha cover of a sequential
alpha cover of the tree Z, okay?
00:50:33.890 --> 00:50:40.310
So, what we would want is not just
any family capital V of trees.
00:50:40.310 --> 00:50:43.240
We want the family capital V
of trees that has the smallest
00:50:43.240 --> 00:50:45.910
cardinality, okay?
00:50:45.910 --> 00:50:48.670
The family capital V that has
the smallest cardinality and
00:50:48.670 --> 00:50:54.390
that can cover Z is the sequential
covering number of the tree Z.
00:50:55.750 --> 00:50:58.140
That's for a particular tree Z.
00:50:58.140 --> 00:51:02.769
Really what we care about is
not just any particular tree Z,
00:51:02.769 --> 00:51:05.465
is the expected covering number.
00:51:05.465 --> 00:51:08.610
We take the the expectation over z,
over all the trees.
00:51:10.090 --> 00:51:15.530
And so to describe the expectation,
00:51:15.530 --> 00:51:18.410
I'm just going to describe
it briefly on this slide.
00:51:18.410 --> 00:51:21.750
Look at the tree now
that is shown here.
00:51:21.750 --> 00:51:26.720
You can play the following game
at time one which is at the root,
00:51:26.720 --> 00:51:30.380
you can draw two samples
if you want Z1 and
00:51:30.380 --> 00:51:34.310
Z prime 1 according
to the distribution.
00:51:34.310 --> 00:51:39.010
And then if you go to the right,
you can draw
00:51:39.010 --> 00:51:43.620
a Z3 and
Z prime 3 again independently, but
00:51:43.620 --> 00:51:49.310
conditioned on the prime variable,
Z prime 1.
00:51:49.310 --> 00:51:54.300
If you go to the left, you can draw
Z2 and Z prime 2 independently,
00:51:54.300 --> 00:51:57.820
but conditioned on the Z1 and
so forth.
00:51:58.990 --> 00:52:03.360
So the expectation of the covering
numbers that I talked about,
00:52:03.360 --> 00:52:08.410
that's an expectation over Z
where the Zs are drawn according
00:52:08.410 --> 00:52:13.390
to the stochastic process that I
just described, okay, all right.
00:52:13.390 --> 00:52:16.220
All of that looks
a little bit complicated.
00:52:16.220 --> 00:52:17.950
It's not a surprise, right?
00:52:17.950 --> 00:52:19.630
We're dealing with a much,
00:52:19.630 --> 00:52:23.620
much more complex situation in
the time series prediction.
00:52:23.620 --> 00:52:28.750
We are far away from the easy
ID case where we can just apply
00:52:28.750 --> 00:52:31.900
blindly a whole bunch of
learning theory tools.
00:52:31.900 --> 00:52:34.098
Now we have to do a little
bit more work, okay?
00:52:36.080 --> 00:52:40.620
So if I have still some time,
and I'm not sure how much I do,
00:52:40.620 --> 00:52:44.640
I don't see anybody jumping on me,
so I'm just gonna say a few words.
00:52:45.650 --> 00:52:47.110
I have two minutes.
00:52:47.110 --> 00:52:48.524
Then in those two minutes.
00:52:48.524 --> 00:52:49.949
>> You have 15 minutes.
00:52:49.949 --> 00:52:53.196
>> I have 15 minutes for
the theory section?
00:52:53.196 --> 00:52:57.933
>> [INAUDIBLE]
>> I have what?
00:52:57.933 --> 00:53:00.388
>> [INAUDIBLE]
>> Okay, so it went from 15 to 2.
00:53:00.388 --> 00:53:01.788
>> [LAUGH]
>> See,
00:53:01.788 --> 00:53:04.754
this is the way negotiations go.
00:53:04.754 --> 00:53:06.840
So, you can see,
I should not negotiate anything.
00:53:09.080 --> 00:53:11.180
So, I went back to two minutes and
00:53:11.180 --> 00:53:13.910
that's perhaps also what
you would prefer, right?
00:53:13.910 --> 00:53:15.680
We should just-
>> [LAUGH]
00:53:15.680 --> 00:53:16.480
>> You've
00:53:16.480 --> 00:53:18.010
been suffering quite a bit.
00:53:18.010 --> 00:53:22.580
Let me just say, we've decided that
you would have a longer break here.
00:53:22.580 --> 00:53:24.710
So, you really deserve
it after this, but
00:53:24.710 --> 00:53:27.260
hopefully you got the big picture.
00:53:27.260 --> 00:53:31.730
So, I'm just gonna give you some
really just the high level ideas for
00:53:31.730 --> 00:53:34.740
the proof of that learning
bound that I just threw at you.
00:53:35.995 --> 00:53:41.755
So the way the proof starts is
in the standard way of applying
00:53:41.755 --> 00:53:46.455
a Chernoff bounding technique which
consists of introducing a variable,
00:53:46.455 --> 00:53:51.645
a positive scalar t that
you would wish to optimize,
00:53:51.645 --> 00:53:55.065
near the end of your proof
to get the best bound.
00:53:55.065 --> 00:53:58.640
And simply of applying
Markov's inequality.
00:53:58.640 --> 00:54:03.020
So this part is really not so
difficult.
00:54:03.020 --> 00:54:07.920
Next, you have to come up
with a symmetrization.
00:54:07.920 --> 00:54:10.750
And here again, I'm just gonna
give you the high level ideas.
00:54:10.750 --> 00:54:13.650
The high level ideas and
the deep techniques behind this
00:54:14.680 --> 00:54:19.550
relate to the notion of
decoupled tangent sequences.
00:54:19.550 --> 00:54:26.350
For any sequence Z1 Zt, there exists
a tangent in associate sequence
00:54:26.350 --> 00:54:31.330
Z prime t, Z prime 1 t, that is a
tangent sequence associated to that.
00:54:31.330 --> 00:54:32.410
What does that mean?
00:54:32.410 --> 00:54:36.487
It simply means the following,
00:54:36.487 --> 00:54:43.131
the sequence Z prime T is such
that Z prime little t and
00:54:43.131 --> 00:54:50.079
Zt are independent,
given the pass Z1, Zt minus 1.
00:54:50.079 --> 00:54:54.800
So imagine that you in fact
draw Z prime in this fashion.
00:54:54.800 --> 00:54:56.410
Whenever we have reached time t,
00:54:56.410 --> 00:55:00.000
you draw Zt of course conditioned
on what you have seen so far,
00:55:00.000 --> 00:55:02.289
but Z prime t also in
an independent fashion.
00:55:03.590 --> 00:55:07.058
Okay, so it's clearer that you can
always construct that decoupled
00:55:07.058 --> 00:55:08.082
tangent sequence.
00:55:08.082 --> 00:55:14.186
This way of decoupling, coming up
with a Z prime that decouples and
00:55:14.186 --> 00:55:19.636
has this independent aspect to
it is crucial to the proof and
00:55:19.636 --> 00:55:22.552
the analysis, okay?
00:55:22.552 --> 00:55:26.110
So that's what is going
on on this slide.
00:55:26.110 --> 00:55:31.121
In the next slide, it's another
property of tangent sequences
00:55:31.121 --> 00:55:36.057
that we're using for
the introduction of variables.
00:55:36.057 --> 00:55:40.890
And then in the last slide,
we are making use of that
00:55:40.890 --> 00:55:46.029
somewhat complicated expected
sequential covering number
00:55:46.029 --> 00:55:51.180
that I mentioned by applying
a standard union bound, okay?
00:55:51.180 --> 00:55:54.790
So I just gave you a very, very
high level idea of the proof, but
00:55:54.790 --> 00:55:57.634
it's essentially what
you would really need.
00:55:57.634 --> 00:55:59.220
And as Vitali mentioned,
00:55:59.220 --> 00:56:02.520
of course you will be able
to look at these slides and
00:56:02.520 --> 00:56:06.886
look at the details to be convinced
that indeed the proof is correct.
00:56:06.886 --> 00:56:10.690
Okay, so I'm gonna stop here and
I wish you a long break and
00:56:10.690 --> 00:56:15.460
when you come back, though,
you will hear about how this theory
00:56:15.460 --> 00:56:19.470
can be use effectively to
come up with algorithms.
00:56:19.470 --> 00:56:21.988
Okay, that have strong
guarantees now, and
00:56:21.988 --> 00:56:23.649
Vitali when I describe that.
00:56:23.649 --> 00:56:25.206
Okay, thank you.
00:56:25.206 --> 00:56:34.288
>> [APPLAUSE]
>> Okay,
00:56:34.288 --> 00:56:39.208
so, let's get settled in and
let's continue and
00:56:39.208 --> 00:56:44.598
talk a little bit about
algorithms that one can develop
00:56:44.598 --> 00:56:50.710
based on this theory that we
have discussed before the break.
00:56:50.710 --> 00:56:54.360
And I hope you're all
not too scared by it.
00:56:54.360 --> 00:56:57.920
But the bad news is that
actually I'll have to
00:56:57.920 --> 00:57:02.330
show a bit of theory for
you again but don't be discouraged.
00:57:02.330 --> 00:57:03.790
This is just a review actually so
00:57:03.790 --> 00:57:06.510
you should all know
what it is already.
00:57:06.510 --> 00:57:12.110
It's actually exactly the same bound
that you have seen before the break.
00:57:12.110 --> 00:57:15.470
And, just to remind you
what it says, right?
00:57:15.470 --> 00:57:21.130
It says that with high probability,
uniformly over hypothesis
00:57:21.130 --> 00:57:27.390
set capital H, generalization
error of any hypothesis.
00:57:27.390 --> 00:57:34.170
Little h and that set of hypothesis
capital H is going to be bounded by
00:57:35.340 --> 00:57:41.060
q weighted empirical error of
that hypothesis on the sample.
00:57:42.546 --> 00:57:47.973
Empirical discrepancy,
this discrepancy delta s,
00:57:47.973 --> 00:57:52.915
that we hope to be small,
and a term that captures
00:57:52.915 --> 00:57:59.322
the complexity of this set of
the hypothesis that we are using.
00:57:59.322 --> 00:58:05.360
I would like to make two
important points about this bound.
00:58:05.360 --> 00:58:10.730
First of all,
we can extend this bound to hold
00:58:10.730 --> 00:58:16.259
uniformly over the weights q,
in some bounded set.
00:58:17.570 --> 00:58:22.390
And second thing that
we kinda tried to
00:58:22.390 --> 00:58:26.330
emphasize also before the break
in the theory section is that
00:58:26.330 --> 00:58:31.030
these guarantees are data dependent,
right?
00:58:31.030 --> 00:58:36.780
And the reason these two points are
important is because they directly
00:58:36.780 --> 00:58:42.560
motivate and help you derive some
algorithmic solution to the problem
00:58:42.560 --> 00:58:47.970
of forecasting
non-stationary time series.
00:58:47.970 --> 00:58:54.070
And the key idea here is to take
this bound that you can actually for
00:58:54.070 --> 00:58:58.750
each hypothesis compute on the data,
right?
00:58:58.750 --> 00:59:06.580
And try to seek a hypothesis h in
the set of hypotheses capital H,
00:59:07.710 --> 00:59:12.220
and the weights q over
the sample that would
00:59:12.220 --> 00:59:16.800
directly minimize the learning
bound on the previous slide.
00:59:18.850 --> 00:59:22.880
So this is a very general
idea of the algorithm, but
00:59:22.880 --> 00:59:28.412
in the rest of this presentation,
I'll try to argue that actually for
00:59:28.412 --> 00:59:33.853
certain very broad families of
hypothesis sets and loss functions,
00:59:33.853 --> 00:59:38.954
you can solve this kind of
optimization problems efficiently.
00:59:38.954 --> 00:59:43.976
And the first set of hypothesis
that you may think about and
00:59:43.976 --> 00:59:46.744
the first set of hypothesis that
00:59:46.744 --> 00:59:52.150
you may be interested in is
the set of linear hypothesis.
00:59:52.150 --> 00:59:57.051
So we'll consider a set of linear
hypothesis with squared loss.
00:59:57.051 --> 01:00:02.252
Just to remind you,
the squared loss is L of y.
01:00:02.252 --> 01:00:06.000
y prime is defined to be
y- y prime squared, right?
01:00:06.000 --> 01:00:11.511
Something that is very natural for
the problems of forecasting.
01:00:11.511 --> 01:00:18.339
And linear hypothesis or
more general kernel-based hypothesis
01:00:18.339 --> 01:00:24.803
associated with PDS kernel K
are defined as follows, right?
01:00:24.803 --> 01:00:31.913
We take a feature map that is
induced by this PDS kernel, right?
01:00:31.913 --> 01:00:35.955
Or simply you can think maybe
of linear kernel, right,
01:00:35.955 --> 01:00:39.610
in that case,
this is just an identity map.
01:00:39.610 --> 01:00:43.899
We map our observations
X by this map,
01:00:43.899 --> 01:00:48.530
phi K to some Hilbert space, right?
01:00:48.530 --> 01:00:53.553
And our hypothesis is going to
be simply a linear function,
01:00:53.553 --> 01:00:56.676
associated to some weight vector w.
01:00:56.676 --> 01:01:01.848
And it just takes the dot
product of that weight vector w
01:01:01.848 --> 01:01:08.154
with the representation of our
observation in the Hilbert space.
01:01:09.260 --> 01:01:14.610
And we'll control
the complexity of this
01:01:16.410 --> 01:01:21.500
linear hypothesis by restricting
the norm of the weight vector
01:01:21.500 --> 01:01:24.165
using some parameter,
capital lambda.
01:01:25.250 --> 01:01:28.834
So this is a very standard set of
hypothesis that is used everywhere
01:01:28.834 --> 01:01:30.690
in machine learning.
01:01:30.690 --> 01:01:37.310
So, for this setting of time series
01:01:37.310 --> 01:01:42.930
sequential models, you can actually
upper-bound that very scary
01:01:42.930 --> 01:01:49.460
sequential covering number term that
appears in the learning guarantee,
01:01:49.460 --> 01:01:53.840
in terms of the parameters of
this hypothesis set and the data.
01:01:53.840 --> 01:01:57.290
In particular,
I give a bound here on this slide,
01:01:57.290 --> 01:02:02.480
that this complexity term
is bounded in terms of this
01:02:02.480 --> 01:02:07.900
parameter lambda, that controls
the norm of the linear hypothesis.
01:02:07.900 --> 01:02:14.790
And the radius of the PDS
kernel on the data,
01:02:14.790 --> 01:02:19.136
the supremum of the kernel (x,
01:02:19.136 --> 01:02:23.090
x) over all the x in the sample.
01:02:23.090 --> 01:02:26.610
We are also gonna make
another simplification.
01:02:26.610 --> 01:02:32.400
And instead of considering
the empirical discrepancy directly,
01:02:32.400 --> 01:02:36.850
we will look at what's known
as instantaneous discrepancy.
01:02:38.020 --> 01:02:44.575
And you can, in fact, upper-bound
the empirical discrepancies
01:02:44.575 --> 01:02:50.545
that appeared in our learning
guarantees by the sum of qtdt,
01:02:50.545 --> 01:02:54.758
where the sum goes from
t = 1 to capital T,
01:02:54.758 --> 01:02:58.738
plus a term that
captures the division of
01:02:58.738 --> 01:03:03.440
the weights q from uniform
weights in L1 norm.
01:03:04.820 --> 01:03:06.590
And these dts,
01:03:06.590 --> 01:03:11.230
in this bound, are what we call
instantaneous discrepancies.
01:03:12.420 --> 01:03:17.010
And they're exactly
the discrepancy between the most
01:03:17.010 --> 01:03:21.410
recent s observations that
we have in our sample,
01:03:21.410 --> 01:03:26.220
and an observation at time t,
01:03:26.220 --> 01:03:30.830
that's the definition of dt for
each time t.
01:03:30.830 --> 01:03:36.146
Again, I'm not gonna go through
the proof of this result here,
01:03:36.146 --> 01:03:40.390
but you're welcome to look
at it at your free time.
01:03:40.390 --> 01:03:43.895
And this is actually a very
simple result that just uses
01:03:43.895 --> 01:03:45.960
sub-additivity of supremum.
01:03:46.980 --> 01:03:52.070
But what is important to actually
discuss here is the computational
01:03:52.070 --> 01:03:55.460
aspects that are involved
in estimating discrepancy.
01:03:56.610 --> 01:03:59.830
Can we actually compute
these quantities?
01:03:59.830 --> 01:04:01.100
We said that statistically,
01:04:01.100 --> 01:04:04.979
we can estimate them but it doesn't
mean that they're computable.
01:04:06.150 --> 01:04:11.800
Well, if we look, actually,
at the definition of this, for
01:04:11.800 --> 01:04:16.790
instance, instantaneous discrepancy
dt, at least in the case of
01:04:16.790 --> 01:04:21.980
this kernel-based hypothesis
with squared loss then we'll
01:04:21.980 --> 01:04:27.050
arrive at the following, perhaps
also scary-looking expression.
01:04:27.050 --> 01:04:29.320
But let's take it apart, right?
01:04:29.320 --> 01:04:33.810
What this expression tells us
is that, to find instantaneous
01:04:33.810 --> 01:04:38.900
discrepancy we need to solve a
certain optimization problem, right?
01:04:38.900 --> 01:04:42.956
And the objective of this
optimization problem that we need to
01:04:42.956 --> 01:04:45.142
solve consists of the two terms.
01:04:45.142 --> 01:04:49.599
In fact, it's a different
of two convex functions,
01:04:49.599 --> 01:04:54.451
which means the optimization
problem that we're going to
01:04:54.451 --> 01:04:57.940
be solving is a DC
programming problem.
01:04:57.940 --> 01:05:01.450
And we can use
existing machinery for
01:05:01.450 --> 01:05:06.125
solving these kind of DC programming
01:05:06.125 --> 01:05:12.160
problems to find what
discrepancy dt is.
01:05:13.860 --> 01:05:18.945
But what is even more important,
that in this special case of squared
01:05:18.945 --> 01:05:24.850
loss, one can actually use
special DC algorithms,
01:05:24.850 --> 01:05:28.380
difference of convex
function algorithms,
01:05:28.380 --> 01:05:32.710
that would allow you to
recover a globally optimal
01:05:32.710 --> 01:05:36.860
solution to this non-convex
optimization problem.
01:05:36.860 --> 01:05:40.750
And this is important because
it allows us to efficiently and
01:05:40.750 --> 01:05:45.810
accurately compute discrepancies
that are gonna be critical for
01:05:45.810 --> 01:05:47.419
our algorithms, as you will see.
01:05:49.060 --> 01:05:52.650
So, okay,
let us put all of this together.
01:05:52.650 --> 01:05:56.230
So first of all,
if we put all of this together for
01:05:56.230 --> 01:06:01.070
the special case of the kernel
hypothesis with squared loss
01:06:01.070 --> 01:06:03.770
function, we have this
learning guarantee.
01:06:03.770 --> 01:06:10.270
Which is almost the same as what
we've seen before, except we
01:06:10.270 --> 01:06:16.100
replace the complexity
term with this
01:06:16.100 --> 01:06:20.480
other upper bound that I
have discussed earlier.
01:06:20.480 --> 01:06:26.161
And we also have an additional
term in the bound
01:06:26.161 --> 01:06:30.190
that comes from the fact that now
01:06:30.190 --> 01:06:34.460
this bound holds uniformly
over all the weights q.
01:06:36.560 --> 01:06:43.760
But this bound, now, can be computed
efficiently on the data, because its
01:06:43.760 --> 01:06:47.140
data dependent learning guarantee,
and all the terms are computable.
01:06:47.140 --> 01:06:50.820
So there is a corresponding
optimization problem that
01:06:50.820 --> 01:06:53.450
needs to be solved in order to
01:06:53.450 --> 01:06:56.690
find the hypothesis that is going
to be used for forecasting.
01:06:57.810 --> 01:07:01.900
And the objective in
this optimization problem
01:07:01.900 --> 01:07:04.350
is going to consist of
01:07:05.870 --> 01:07:08.660
the following four terms that
directly come from the upper bound.
01:07:08.660 --> 01:07:13.390
So the first term in
the objective is your empirical
01:07:13.390 --> 01:07:16.490
q weighted L2 loss, right?
01:07:16.490 --> 01:07:17.180
Very natural.
01:07:18.430 --> 01:07:23.570
The second term is
the q weighted sum
01:07:23.570 --> 01:07:28.570
of the instantaneous discrepancies,
that we know how to compute.
01:07:31.000 --> 01:07:35.750
The third term in the objective
is the regularization term for
01:07:35.750 --> 01:07:38.900
the linear hypothesis
that we're going to use.
01:07:38.900 --> 01:07:41.210
And it directly comes, actually,
01:07:41.210 --> 01:07:44.960
from the complexity term
in the bound, right?
01:07:44.960 --> 01:07:51.480
And the last term controls
the deviations of the vector q
01:07:52.900 --> 01:07:56.190
from a uniform distribution.
01:07:56.190 --> 01:07:59.930
And it serves as
a regularization for
01:07:59.930 --> 01:08:05.003
this parameter that
we're optimizing, right?
01:08:05.003 --> 01:08:10.526
Here also, lambda 1, lambda 2,
and lambda 3 are just
01:08:10.526 --> 01:08:16.410
hyper-parameters that you
set via cross-validation.
01:08:16.410 --> 01:08:23.190
Note that this optimization problem
is joint over both q and w.
01:08:24.440 --> 01:08:29.022
And, at this moment you may be
thinking, wait a second, is it okay,
01:08:29.022 --> 01:08:33.630
is the optimization problem
easy to solve, is it convex?
01:08:33.630 --> 01:08:38.510
And in fact, that's a very
right concern, in fact,
01:08:38.510 --> 01:08:43.230
this formulation of the optimization
problem is not convex.
01:08:43.230 --> 01:08:45.970
But you can alter this
01:08:45.970 --> 01:08:49.750
optimization a little bit to
arrive to a convex formulation.
01:08:50.810 --> 01:08:55.000
In particular, what you can do is
you can change the variables and
01:08:55.000 --> 01:08:59.030
set rt to be 1/qt.
01:08:59.030 --> 01:09:04.820
And then if you further upper
bound the regularization term for
01:09:04.820 --> 01:09:08.790
the q's that appeared in
the previous objective,
01:09:08.790 --> 01:09:12.250
then you arrive to
this new formulation.
01:09:12.250 --> 01:09:14.960
And it's almost exactly as
the previous formulation,
01:09:14.960 --> 01:09:20.390
we still have the first term,
which is a weighted
01:09:20.390 --> 01:09:24.780
empirical error, but
now the weights are 1/rt.
01:09:24.780 --> 01:09:29.696
And this is, I remind you,
a convex function in w and r,
01:09:29.696 --> 01:09:34.000
because this is a square
over a linear function.
01:09:35.760 --> 01:09:40.614
Then, the second term is
this weighted instantaneous
01:09:40.614 --> 01:09:45.640
discrepancy term, but
now the weights are 1/rt.
01:09:45.640 --> 01:09:50.260
Again, this is going to be a convex
function for the same reason.
01:09:50.260 --> 01:09:54.647
And the last two terms,
these are regularization terms,
01:09:54.647 --> 01:09:56.356
L2 norm and L1 norm.
01:09:56.356 --> 01:09:59.365
And they are still
going to be convex.
01:09:59.365 --> 01:10:04.170
So, overall we get a convex
formulation for our problem.
01:10:04.170 --> 01:10:05.651
So that's nice.
01:10:05.651 --> 01:10:11.680
Let me also mention that there
is another way to obtain sort
01:10:11.680 --> 01:10:19.380
of simpler algorithm that does not
learn q's and w's together, right.
01:10:19.380 --> 01:10:22.710
But rather, does it in
a two stage procedure, and
01:10:22.710 --> 01:10:25.130
it's also going to be convex.
01:10:25.130 --> 01:10:31.050
And the way this algorithm operates
is that you first just minimize
01:10:31.050 --> 01:10:35.350
your discrepancy empirical
discrepancy on the data of this way
01:10:36.760 --> 01:10:41.220
you would obtain after Q right and
01:10:41.220 --> 01:10:44.260
this also can be shown
to be a convex problem.
01:10:44.260 --> 01:10:48.950
And once you have done that,
you take this vector q and you solve
01:10:48.950 --> 01:10:54.230
a weighted version of
a kernel-ridge regression problem.
01:10:54.230 --> 01:10:57.250
Which is again,
a standard convex problem.
01:10:59.440 --> 01:11:03.180
These are the algorithmic
ideas that you can use for
01:11:03.180 --> 01:11:06.200
forecasting non-stationary
time series.
01:11:06.200 --> 01:11:11.150
And we actually implemented some of
these and played around with them
01:11:11.150 --> 01:11:16.600
both on some synthetic data and
some real data sets,
01:11:16.600 --> 01:11:20.700
and I will now describe some of
the experiments that we've done.
01:11:20.700 --> 01:11:25.000
And let's start with just some
very simple synthetic examples.
01:11:25.000 --> 01:11:30.540
Just to get some intuition about
how these things work in practice.
01:11:30.540 --> 01:11:35.170
So on this slide, I plot for you
a very simple stochastic process,
01:11:35.170 --> 01:11:40.790
which is basically autoregression
of order one, with a slight twist.
01:11:40.790 --> 01:11:46.320
So for the first thousand
observations, I set the parameter,
01:11:46.320 --> 01:11:49.852
the autoregressive parameter,
in this model to be 0.9.
01:11:49.852 --> 01:11:52.540
All right,
the autoregressive coefficient.
01:11:52.540 --> 01:11:57.185
And then, for
the next thousand of observations,
01:11:57.185 --> 01:12:01.403
I switch the parameter
to be negative 0.9.
01:12:01.403 --> 01:12:04.677
And then I switch this
parameter back for
01:12:04.677 --> 01:12:08.934
the last thousand
observations to be 0.9 again.
01:12:08.934 --> 01:12:13.485
And so this way I
introduce this nontrivial,
01:12:13.485 --> 01:12:17.940
nonstationary in
the stochastic process.
01:12:17.940 --> 01:12:23.370
And it's interesting to see how
the estimation of discrepancy
01:12:23.370 --> 01:12:26.659
behaves in this kind of scenarios
and how our algorithms work.
01:12:28.150 --> 01:12:31.380
And what we can do, we can look,
actually, at the plots of
01:12:31.380 --> 01:12:36.250
the discrepancies that would be
estimated for this kind of process.
01:12:36.250 --> 01:12:41.530
And you can see they're plotted
on the left of this slide.
01:12:41.530 --> 01:12:45.530
In red, you see the estimated
discrepancies, and
01:12:45.530 --> 01:12:47.640
in green,
this is the true discrepancy.
01:12:47.640 --> 01:12:51.780
And you can see the estimation
is actually rather noisy, right.
01:12:51.780 --> 01:12:55.250
But at least it gets
the shape right.
01:12:55.250 --> 01:12:58.970
You can see that in the region
01:12:58.970 --> 01:13:03.620
between T equal a 1,000 and
T equals 2,000, right.
01:13:03.620 --> 01:13:06.850
Where I had this switch.
01:13:06.850 --> 01:13:11.080
At least the average discrepancy
is much higher than in
01:13:11.080 --> 01:13:11.849
the other regions.
01:13:13.880 --> 01:13:16.660
But what's even more
interesting is that these
01:13:16.660 --> 01:13:19.770
algorithms can actually
be quite robust
01:13:21.190 --> 01:13:26.580
to these errors in the estimation
of the discrepancies themselves.
01:13:26.580 --> 01:13:32.450
So if we actually look at
the weights queues that are learned
01:13:32.450 --> 01:13:38.140
by this algorithms and they are
plotted on the right of this slide.
01:13:38.140 --> 01:13:43.330
And you'll see that in red we have
the weights that were learned for
01:13:43.330 --> 01:13:46.270
estimated discrepancies and
in green we have
01:13:46.270 --> 01:13:49.380
the weights that have been learned
for the true discrepancies.
01:13:49.380 --> 01:13:52.855
You can see that in both cases,
01:13:52.855 --> 01:13:57.802
the algorithm completely
avoids the region
01:13:57.802 --> 01:14:03.160
where t equals 1000 and
t equals 2000.
01:14:03.160 --> 01:14:08.452
Which is exactly what we would
desire, because in that region,
01:14:08.452 --> 01:14:14.136
the distribution of the process is
completely irrelevant to the most
01:14:14.136 --> 01:14:20.021
recent observations which would be
useful for casting the next point.
01:14:21.924 --> 01:14:25.960
And just to wrap this section up.
01:14:25.960 --> 01:14:31.550
Here are some further
examples of the error plots
01:14:31.550 --> 01:14:34.780
for this artificial synthetic data.
01:14:36.080 --> 01:14:39.680
What we see here is running
mean squared errors, so
01:14:39.680 --> 01:14:44.640
what we do we train our
three algorithms which
01:14:44.640 --> 01:14:49.600
is ARIMA, the discrepancy-based
forecasting with true discrepancies.
01:14:49.600 --> 01:14:54.030
And discrepancy-based forecasting
with estimated discrepancies.
01:14:55.300 --> 01:14:59.440
And once we trained them
on the first 500 points,
01:15:00.490 --> 01:15:04.489
we forecast the next
25 points I believe.
01:15:06.010 --> 01:15:08.070
Then we record their error,
01:15:08.070 --> 01:15:11.990
we put this new 25 points
in the training set.
01:15:11.990 --> 01:15:15.850
We're gonna train with
this new training set.
01:15:15.850 --> 01:15:20.350
We forecast the next 25 points and
we proceed in this online session
01:15:20.350 --> 01:15:23.850
and then we have this running
flow to means squared errors.
01:15:23.850 --> 01:15:28.380
And as you can see, before we
encounter any non stationary
01:15:28.380 --> 01:15:32.020
in the data, these algorithmic
are sort of behaving about the same.
01:15:32.020 --> 01:15:35.600
But then once you kind of
heed this non-stationary.
01:15:37.630 --> 01:15:38.750
As you would expect,
01:15:38.750 --> 01:15:42.650
the error in a Arima model
that is not designed to handle
01:15:42.650 --> 01:15:47.000
this kind of no stationary can blows
up but these algorithms is kind of
01:15:47.000 --> 01:15:51.350
remain robust to this non stationary
change in the distribution.
01:15:52.650 --> 01:15:58.610
So, to mention this,
we did some also experiments with
01:15:58.610 --> 01:16:05.130
some real world data sets and there
are some promising results there but
01:16:05.130 --> 01:16:09.760
they see that I am running
out of time actually and
01:16:09.760 --> 01:16:14.550
instead what I'm gonna do rather
than talking about these I'm gonna
01:16:14.550 --> 01:16:18.815
invite and he's gonna talk.
01:16:18.815 --> 01:16:24.729
>> [INAUDIBLE]
>> Yes, yes, sure.
01:16:26.050 --> 01:16:26.550
Thank you.
01:16:27.740 --> 01:16:32.650
Well, since I have some more time,
we did some experiments.
01:16:35.690 --> 01:16:40.440
Some data sites that we
found on the internet.
01:16:40.440 --> 01:16:45.870
These are commodity prices for
different commodities,
01:16:45.870 --> 01:16:50.730
exchange rates, climate and
temperature forecasting.
01:16:50.730 --> 01:16:55.880
And as you can see from these
results, we compared ourselves
01:16:55.880 --> 01:16:59.790
against very standard baseline
that people would first compare
01:16:59.790 --> 01:17:04.839
themselves, Arima models and
at least against those models.
01:17:06.310 --> 01:17:09.640
We never perform worse
than ARIMA models.
01:17:09.640 --> 01:17:16.090
And in I think five out of eight
datasets here we do better.
01:17:17.580 --> 01:17:21.250
So this is the algorithmic section.
01:17:21.250 --> 01:17:26.562
And next what we're going to talk
about is the connection between
01:17:26.562 --> 01:17:31.681
sort of time series prediction and
this new hot area that people
01:17:31.681 --> 01:17:36.237
very interested in these days
that of online learning.
01:17:41.530 --> 01:17:43.423
>> Okay.
01:17:43.423 --> 01:17:49.005
So we switch gears and
try now to, after this
01:17:49.005 --> 01:17:54.815
theoretical section and then
the algorithms and the experiments,
01:17:54.815 --> 01:17:58.925
to talk a connection with
online learning which, as said,
01:17:58.925 --> 01:18:02.522
is probably one of
the most Important and
01:18:02.522 --> 01:18:06.460
fast growing areas of research and
machine learning.
01:18:07.500 --> 01:18:12.500
Starting by saying that there are
really two learning scenarios for
01:18:12.500 --> 01:18:15.080
tackling time series.
01:18:15.080 --> 01:18:19.680
The first one is probably the one
that we favored in our presentation
01:18:19.680 --> 01:18:20.780
up to now.
01:18:20.780 --> 01:18:23.700
And that is that of
the stochastic scenario,
01:18:23.700 --> 01:18:28.380
which assumes that there is
a distribution underlying the data.
01:18:28.380 --> 01:18:34.380
And there the performance measure
is in terms of the expected loss,
01:18:34.380 --> 01:18:38.391
or for us this path
dependent expected loss.
01:18:38.391 --> 01:18:43.100
And the learning guarantees are
generalization bounds such as those
01:18:43.100 --> 01:18:45.220
that I presented previously.
01:18:46.250 --> 01:18:52.400
The on-line learning scenario, in
a way, is a much broader scenario.
01:18:52.400 --> 01:18:55.730
It is one where no distributional
assumption is made.
01:18:56.970 --> 01:18:59.590
The setup is adversarial.
01:18:59.590 --> 01:19:03.800
The performance measure is
based on a notion of regret.
01:19:03.800 --> 01:19:07.670
And the guarantees are,
therefore, bounds on that regret.
01:19:07.670 --> 01:19:10.840
This is a very,
very active research area.
01:19:10.840 --> 01:19:16.780
We've just listed here
some papers and names.
01:19:16.780 --> 01:19:22.510
Clearly, this is not covering
even a small part of online
01:19:22.510 --> 01:19:27.470
learning definitely would wish
to consult the survey book
01:19:27.470 --> 01:19:32.060
of if you're interested
in this area.
01:19:32.060 --> 01:19:38.100
There is a very rich number
of papers related to this.
01:19:38.100 --> 01:19:44.100
We are favoring here
some papers by or
01:19:44.100 --> 01:19:49.510
One reason for that there is
a rich online learning algorithms,
01:19:49.510 --> 01:19:52.020
family of algorithms for tracking.
01:19:52.020 --> 01:19:57.177
For example, that could be relevant
for time series, etc, etc.
01:19:57.177 --> 01:20:02.145
So, since I'm not sure how much
the audience here is familiar with
01:20:02.145 --> 01:20:06.847
online learning, I'm going to
give you probably the shortest
01:20:06.847 --> 01:20:10.500
tutorial That is possible for
online learning.
01:20:10.500 --> 01:20:15.221
I have a four slide tutorial and
then we'll see how we could maybe
01:20:15.221 --> 01:20:18.610
use that to connect
it with time series.
01:20:18.610 --> 01:20:21.970
And solve problems in time series
that we don't really know how to
01:20:21.970 --> 01:20:24.470
tackle otherwise, okay?
01:20:24.470 --> 01:20:27.500
So let me start with the setup
of on-line learning.
01:20:28.730 --> 01:20:32.778
In the on-line learning setup,
it's an adversarial
01:20:32.778 --> 01:20:36.830
setting with a hypothesis set or
action set capital H.
01:20:38.260 --> 01:20:45.010
This goes over capital T rounds,
at each round little t the player or
01:20:45.010 --> 01:20:51.990
the learner receives a point
x sub t drawn from capital X.
01:20:51.990 --> 01:20:59.199
He selects in response, the
hypothesis H sub T out of capital H.
01:20:59.199 --> 01:21:03.990
The adversary selects
some label Y sub t.
01:21:03.990 --> 01:21:10.295
And the player suffers or incurs
the loss of ht of Xt, the prediction
01:21:10.295 --> 01:21:16.900
over its divergence with respect
to the true label Y sub t.
01:21:16.900 --> 01:21:19.440
So the goal of the learner or
01:21:19.440 --> 01:21:24.240
the player is to minimize
his cumulative loss.
01:21:24.240 --> 01:21:28.720
The first term here in
the definition of the regret
01:21:28.720 --> 01:21:31.305
indicated here at
the bottom of the slide.
01:21:31.305 --> 01:21:34.800
Now the cumulative loss is sum of
the losses suffered over capital T
01:21:34.800 --> 01:21:41.845
rounds, minus the best
cumulative loss in hindsight.
01:21:41.845 --> 01:21:46.099
Now looking back at the data,
suffered by the best
01:21:46.099 --> 01:21:50.469
hypothesis little h taken
out of the family h star.
01:21:50.469 --> 01:21:53.393
And the family h star might
not always be capital H,
01:21:53.393 --> 01:21:56.357
might not coincide with
that necessarily, okay?
01:21:56.357 --> 01:22:00.110
So that's the notion of regret,
which is how much
01:22:00.110 --> 01:22:02.850
will be the different in
respect to that reference.
01:22:04.270 --> 01:22:07.990
Here's an example of
an on-line learning algorithm
01:22:07.990 --> 01:22:12.400
that's a very standard one, based on
exponential weights or exponentiated
01:22:12.400 --> 01:22:16.490
weighted average algorithms,
people would refer that as well.
01:22:16.490 --> 01:22:22.470
Which is a case where you would
seek to learn with expert advice.
01:22:22.470 --> 01:22:28.880
So let capital H star be
a family of capital N experts.
01:22:28.880 --> 01:22:31.120
E1, E2, EN.
01:22:31.120 --> 01:22:35.940
And denote by capital H,
the convex hull of H star.
01:22:35.940 --> 01:22:41.240
That means the convex combinations
of such expert functions.
01:22:42.320 --> 01:22:44.660
The algorithm is quite simple.
01:22:44.660 --> 01:22:48.150
It's maintaining some weights.
01:22:48.150 --> 01:22:53.240
W1, W sub capital N for the experts.
01:22:53.240 --> 01:22:57.010
And so these are index by the time,
at the beginning,
01:22:57.010 --> 01:23:02.550
this is w1 sub i for
expert o and time one.
01:23:02.550 --> 01:23:04.800
At the beginning,
they're all equal to one.
01:23:04.800 --> 01:23:09.520
And then at each round,
the learner receives xt,
01:23:09.520 --> 01:23:14.510
he predicts according to this
average of the functions
01:23:14.510 --> 01:23:19.720
of the experts and the averaging
is done using these weights.
01:23:20.960 --> 01:23:25.938
He receives true label yt,
incurs the loss that is base on this
01:23:25.938 --> 01:23:29.520
function, this averaging
function h of t.
01:23:29.520 --> 01:23:34.711
And he updates the weights in
this exponential fashion which
01:23:34.711 --> 01:23:40.002
consist of multiplying the weight
at time T by the exponential
01:23:40.002 --> 01:23:44.902
of minus eta, the loss suffered
by that expert i, okay?
01:23:44.902 --> 01:23:48.270
So the weight of the expert
i is updating that pressure.
01:23:48.270 --> 01:23:50.131
So if the loss is large,
01:23:50.131 --> 01:23:54.888
the weight is significantly
decreased for example, okay?
01:23:54.888 --> 01:23:59.410
It's a very simple algorithm, but
even though it's a very simple
01:23:59.410 --> 01:24:05.240
algorithm, it leads to a strong
guarantee which is the following.
01:24:05.240 --> 01:24:08.600
If the loss function L is
convex in its first argument,
01:24:08.600 --> 01:24:11.280
which would be the case for
a variety of cases in practice.
01:24:11.280 --> 01:24:13.860
Let's say for
example squared error or
01:24:13.860 --> 01:24:16.630
other loss functions of that kind.
01:24:16.630 --> 01:24:20.760
And if it's a bounded loss,
I'm taking values in 0, 1.
01:24:20.760 --> 01:24:24.430
Then for any sequence y1, yt.
01:24:24.430 --> 01:24:26.860
This is very strong because it's
01:24:26.860 --> 01:24:30.620
not making any stochastic
assumption about that sequence.
01:24:30.620 --> 01:24:35.420
The regret of this
algorithm at time capital T
01:24:35.420 --> 01:24:39.250
is upper bounded in the way
that is indicated here.
01:24:39.250 --> 01:24:41.120
Log N over eta plus eta T.
01:24:41.120 --> 01:24:44.205
And if you choose
eta in the best way,
01:24:44.205 --> 01:24:48.500
eta is this learning rate that
appeared in the exponential.
01:24:48.500 --> 01:24:52.434
It says that the regret is
upper bounded by roughly,
01:24:52.434 --> 01:24:54.980
the square root of T times log N.
01:24:56.110 --> 01:24:59.680
So to understand what this means,
it simply means that
01:25:00.770 --> 01:25:05.280
if you choose a relatively
large set of experts.
01:25:05.280 --> 01:25:08.440
So if your capital N
is relatively large,
01:25:08.440 --> 01:25:13.070
then this would give you
an opportunity for probably having
01:25:13.070 --> 01:25:18.450
a best expert in hindsight
that has a small loss.
01:25:18.450 --> 01:25:23.060
On the other hand, you pay for this
in terms of this upper bound log N.
01:25:23.060 --> 01:25:28.430
But the price to pay is only
logarithmic, so it's very favorable.
01:25:28.430 --> 01:25:31.600
The regret is perhaps not even
the term that you would really
01:25:31.600 --> 01:25:32.650
be interested in.
01:25:32.650 --> 01:25:36.620
It's even more precisely the regret
per round, so that means when you're
01:25:36.620 --> 01:25:40.350
dividing the regret by capital T,
the average regret.
01:25:40.350 --> 01:25:44.260
And here you see then at the bottom
of the slide that the average regret
01:25:44.260 --> 01:25:48.562
is decreasing as essentially
1 over square root of T.
01:25:48.562 --> 01:25:52.900
Okay, so these are the types of
guarantees that you would see for
01:25:52.900 --> 01:25:55.780
a variety of algorithms
in online learning.
01:25:55.780 --> 01:25:59.660
Again, very general set up,
very general guarantees.
01:25:59.660 --> 01:26:05.105
The proofs for these sort of
guarantees go as follows.
01:26:05.105 --> 01:26:07.292
You define a potential, here,
01:26:07.292 --> 01:26:12.740
the natural potential function is
the log of the sum of the weights.
01:26:12.740 --> 01:26:15.190
Then essentially,
you've proven upper bound for
01:26:15.190 --> 01:26:18.310
this potential function and
the lower bound.
01:26:18.310 --> 01:26:21.540
You compare those two and
that gives you the regret bound,
01:26:21.540 --> 01:26:26.070
here to derive that upper bound,
you start by analyzing
01:26:26.070 --> 01:26:29.940
the difference of the potential
at time T and time T minus 1.
01:26:29.940 --> 01:26:35.890
It gives you the log of ratio of two
ways the denominator is the previous
01:26:35.890 --> 01:26:40.810
ways, the numerator is the new ways,
which are exponentially updated.
01:26:40.810 --> 01:26:45.000
And that you can relate to in fact,
in expectation in terms of some
01:26:45.000 --> 01:26:48.060
weights, you have applied
Hoeffding's inequality and
01:26:48.060 --> 01:26:51.800
that gives you the type of
upper bound at the bottom.
01:26:51.800 --> 01:26:55.820
You sum up these difference and
using coscopic sums,
01:26:55.820 --> 01:27:00.297
that gives you immediately
upper bounds on 5T minus 50.
01:27:00.297 --> 01:27:05.111
You can trivially come up with
a lower bound by saying that the sum
01:27:05.111 --> 01:27:08.170
can be lower bounded by a max.
01:27:08.170 --> 01:27:11.110
And that is the two lines that
are mentioned in the middle of
01:27:11.110 --> 01:27:12.150
this slide.
01:27:12.150 --> 01:27:16.350
Comparing these quantities
immediately gives you the bound.
01:27:16.350 --> 01:27:22.360
Okay, so this is just a very quick
tour through online learning,
01:27:22.360 --> 01:27:24.910
probably the shortest
tutorial you have
01:27:24.910 --> 01:27:28.370
ever heard of online learning and
you would ever hear probably.
01:27:29.950 --> 01:27:33.820
This raises though
a whole set of questions.
01:27:33.820 --> 01:27:36.880
Can we explore it,
both the batch scenario,
01:27:36.880 --> 01:27:41.690
the stochastic scenario and the
online learning scenario to come up
01:27:41.690 --> 01:27:47.800
with flexible algorithms for
tackling time series prediction.
01:27:47.800 --> 01:27:52.610
One reason for that is that online
learning setup, as I indicated,
01:27:52.610 --> 01:27:54.570
is in fact very rich.
01:27:54.570 --> 01:27:58.530
I mentioned this way of measuring
the regret with respect to the best
01:27:58.530 --> 01:28:03.200
expert in hindsight, just to
tell you a few words about this.
01:28:03.200 --> 01:28:06.160
You can extend this to the case
where in fact you would not
01:28:06.160 --> 01:28:09.140
just be comparing yourself to
the best expert in hindsight.
01:28:09.140 --> 01:28:13.100
But to the best expert that
is varying over time to
01:28:13.100 --> 01:28:16.030
the case shifting best expert.
01:28:16.030 --> 01:28:20.380
The top of algorithms for
01:28:20.380 --> 01:28:23.850
these sort of scenarios
have already been analyzed.
01:28:23.850 --> 01:28:27.550
These are rich family of
algorithms that you can pick from.
01:28:27.550 --> 01:28:30.920
They're all simple to implement.
01:28:30.920 --> 01:28:33.790
They're based on
simple updates such as
01:28:33.790 --> 01:28:37.380
the exponential update
that I indicated before.
01:28:37.380 --> 01:28:41.520
So they're really easy
toolbox that you wish to use.
01:28:41.520 --> 01:28:44.740
So can we use these algorithms to
come up with good solutions in
01:28:44.740 --> 01:28:47.140
this stochastic case for
time series.
01:28:47.140 --> 01:28:52.040
Another crucial, it turns out,
the usefulness or
01:28:52.040 --> 01:28:55.930
question that comes up in this
scenario is that of, can we tackle
01:28:55.930 --> 01:29:00.230
some really, really difficult time
series problems that up to now,
01:29:00.230 --> 01:29:03.629
part of these we have been sweeping
under the carpet in a way.
01:29:04.750 --> 01:29:09.570
Which are for example,
model selection in time series or
01:29:09.570 --> 01:29:12.920
learning ensembles in time
series prediction which
01:29:12.920 --> 01:29:15.660
is a crucial problem
again in this series.
01:29:15.660 --> 01:29:19.140
So let me say a few words
about model selection and
01:29:19.140 --> 01:29:20.260
learning ensembles and
01:29:20.260 --> 01:29:23.980
then see how we could gradually
tackle these sort of things.
01:29:23.980 --> 01:29:28.480
Model selection is this
general problem of course that
01:29:28.480 --> 01:29:31.840
we have in machine learning
which consist of say,
01:29:31.840 --> 01:29:37.370
if you have capital N models,
your time series models.
01:29:37.370 --> 01:29:42.871
The question is,
how can we use a sample Z1,
01:29:42.871 --> 01:29:47.388
ZT to come up to find a best model
to choose among these models?
01:29:47.388 --> 01:29:51.577
In the i.i.d case,
this is a familiar problem,
01:29:51.577 --> 01:29:55.467
how we have theory such
as the structure of risk
01:29:55.467 --> 01:29:59.381
minimization theory
about model selection.
01:29:59.381 --> 01:30:04.828
And we know that those
guarantees can be approximated
01:30:04.828 --> 01:30:09.192
by using cross-validation
techniques.
01:30:09.192 --> 01:30:13.546
Uh-huh, reserving some part of
the data violation and using
01:30:13.546 --> 01:30:18.500
that to pick the best parameter,
for example, but how do we do that?
01:30:18.500 --> 01:30:20.320
How do we do validation sets?
01:30:20.320 --> 01:30:22.210
Come up with the validation set for
01:30:22.210 --> 01:30:24.850
general statistic
processes in time series.
01:30:26.160 --> 01:30:27.860
How should the validation
set be chosen?
01:30:29.510 --> 01:30:33.965
The last few points, if we do so, we
are discarding everything else that
01:30:33.965 --> 01:30:38.136
we have seen before that could
have been really, really relevant.
01:30:38.136 --> 01:30:39.860
If we choose the first few points,
01:30:39.860 --> 01:30:42.657
then we may be losing more
critical information because
01:30:42.657 --> 01:30:46.240
that's really information that would
be close to the predicted time.
01:30:47.600 --> 01:30:51.570
What else should we do, split the
data in various different ways and
01:30:51.570 --> 01:30:53.770
choose those as validation set?
01:30:53.770 --> 01:30:55.940
It's just really not clear
what we should be doing.
01:30:57.460 --> 01:31:00.240
By the way, if you hear people
talking about time series and
01:31:00.240 --> 01:31:04.260
saying, we're doing model selection
in some way, cross-validation or
01:31:04.260 --> 01:31:08.600
something, ask them,
what is it that you do?
01:31:08.600 --> 01:31:13.020
Because I bet you there is no
real principle behind it, okay?
01:31:15.400 --> 01:31:18.420
I noticed that the problem can be
complicated in the sense that,
01:31:18.420 --> 01:31:21.350
of course, these models had
been probably pre-trained
01:31:21.350 --> 01:31:23.770
on that same training sample, okay?
01:31:25.980 --> 01:31:30.001
It's another interesting problem
that comes up if you wish to do
01:31:30.001 --> 01:31:31.498
accurate prediction.
01:31:31.498 --> 01:31:35.935
To answer this prediction in any
kind of context, whether that is for
01:31:35.935 --> 01:31:39.682
earthquakes as I heard a little
bit from some people here,
01:31:39.682 --> 01:31:44.041
whether it's for predicting the
level of battery for your system,
01:31:44.041 --> 01:31:47.735
or, I don't know,
making predictions in Wall Street.
01:31:47.735 --> 01:31:52.860
That is the one where imagine
that you have already a family of
01:31:52.860 --> 01:31:57.810
good predictors,
a family of good forecasters.
01:31:57.810 --> 01:32:02.850
Can you use them to
come up with a good,
01:32:02.850 --> 01:32:06.740
perhaps a better and more accurate
solution by combining them?
01:32:06.740 --> 01:32:09.610
By taking a convex
combination of them,
01:32:09.610 --> 01:32:14.150
sum of QTHT if the HTs
are your forecasters.
01:32:15.490 --> 01:32:18.380
Again here it's not a trivial
problem because this
01:32:18.380 --> 01:32:21.600
forecast with the HTs may
have been even pre-trained or
01:32:21.600 --> 01:32:24.010
they could be the result
of training on that sample.
01:32:25.220 --> 01:32:29.030
It turns out that both this model
selection problem that I mentioned
01:32:29.030 --> 01:32:34.070
and the problem of learning
ensembles They can both be
01:32:34.070 --> 01:32:38.880
tackled and addressed by
coming up with a solution to
01:32:38.880 --> 01:32:44.091
the so-called problem of
online-to-batch conversion.
01:32:44.091 --> 01:32:48.885
Online to To batch is
a familiar term and problem for
01:32:48.885 --> 01:32:51.945
those of you who are familiar
with online learning.
01:32:51.945 --> 01:32:54.815
But here it's a much
more difficult problem,
01:32:54.815 --> 01:32:59.825
because we're dealing with a
non-stationary, non-mixing processes
01:32:59.825 --> 01:33:02.755
that are quite different
from the sort of IID cases.
01:33:02.755 --> 01:33:05.900
So let me just first define
what an online to batch
01:33:05.900 --> 01:33:09.150
Problem consist of and
how it can be tackled.
01:33:09.150 --> 01:33:14.340
But before even doing so, and
just to make sure that I'm not gonna
01:33:14.340 --> 01:33:17.990
run out of time for saying that,
there is going to be a time series
01:33:17.990 --> 01:33:23.300
workshop here at NIBS, and
I invite you to attend.
01:33:23.300 --> 01:33:29.150
This is also co-organized by Vitali
and several other people And
01:33:29.150 --> 01:33:33.270
I will be describing in
much more detail some of
01:33:33.270 --> 01:33:37.940
the solutions here and connections
between online learning and
01:33:37.940 --> 01:33:40.560
the stochastic case
int that workshop.
01:33:41.605 --> 01:33:44.006
So, let me go back to
online to batch problem.
01:33:44.006 --> 01:33:47.405
The On-Line-to-Batch problem
is a very natural problem.
01:33:49.195 --> 01:33:52.025
Let's denote by bold h.
01:33:52.025 --> 01:33:58.045
The sequence of hypothesis that an
online learning algorithm has been
01:33:58.045 --> 01:34:02.729
returning or outputting over
the course of capital T rounds.
01:34:05.340 --> 01:34:15.520
So the problem is to come
up with a hypothesis.
01:34:15.520 --> 01:34:17.210
Okay, so I'm a little bit
confused here, okay, so
01:34:17.210 --> 01:34:18.960
this is what the input
is to the problem.
01:34:18.960 --> 01:34:21.330
That's why,
that's the input to the problem.
01:34:21.330 --> 01:34:25.330
And the problem is to come
up with a hypothesis.
01:34:25.330 --> 01:34:32.484
Little h and capital H,
that has small path-dependent
01:34:32.484 --> 01:34:37.900
expected loss,
using these hypotheses h1,
01:34:37.900 --> 01:34:42.810
h2, hT, that the online learning
algorithm has been producing.
01:34:42.810 --> 01:34:45.810
How do we use those
01:34:45.810 --> 01:34:48.990
hypothesis that the on-line
learning algorithm has produced?
01:34:48.990 --> 01:34:52.850
Use them to come up with
the hypothesis that now has a good
01:34:52.850 --> 01:34:54.740
path dependent expected loss.
01:34:54.740 --> 01:34:57.368
This is not any more
a matter of regret, okay.
01:34:57.368 --> 01:35:00.880
In the setup,
01:35:00.880 --> 01:35:05.420
this problem is well known and
has good solutions.
01:35:05.420 --> 01:35:08.640
But again,
how do we design solutions here for
01:35:08.640 --> 01:35:11.120
this general time series scenario?
01:35:12.470 --> 01:35:16.685
One question is even to batch
with general non-stationary,
01:35:16.685 --> 01:35:20.130
non-mixing stochastic
processes even possible?
01:35:20.130 --> 01:35:22.780
Can be design algorithms
with guarantees.
01:35:22.780 --> 01:35:25.070
And here agin we need a new tool,
01:35:25.070 --> 01:35:29.680
that tool is not surprisingly
similar to what we discussed before.
01:35:29.680 --> 01:35:33.410
It should be, it's similar
it's a discrepancy tool.
01:35:33.410 --> 01:35:37.060
But there is a new twist to it,
because here.
01:35:37.060 --> 01:35:39.630
We're not dealing with
arbitrary hypothesis.
01:35:39.630 --> 01:35:42.140
We're dealing with hypothesis that
01:35:42.140 --> 01:35:45.800
are output by the online
learning algorithm.
01:35:45.800 --> 01:35:49.470
So if you look at the same key
difference that key difference now
01:35:49.470 --> 01:35:50.290
is for
01:35:50.290 --> 01:35:55.250
a particular h sub t that is output
by the online learning algorithm.
01:35:55.250 --> 01:35:58.560
Okay, and so when now we
are taking the averaging.
01:35:58.560 --> 01:36:01.060
We're not gonna be taking
the averaging over
01:36:01.060 --> 01:36:03.480
that's a fixed h sub t.
01:36:03.480 --> 01:36:07.450
But the h sub t that the online
learning algorithm has been
01:36:07.450 --> 01:36:10.710
producing at each
timestamp little t.
01:36:10.710 --> 01:36:14.020
And that's what the quantity that
is in the bottom of the slide
01:36:14.020 --> 01:36:14.820
indicates.
01:36:14.820 --> 01:36:16.219
Now that's the average
difference here.
01:36:17.600 --> 01:36:18.370
That's the difference.
01:36:18.370 --> 01:36:20.520
You see the H of T varies
with the time here.
01:36:22.000 --> 01:36:26.150
That leads to a notion of
discrepancies against starting from
01:36:26.150 --> 01:36:27.180
first principles, right?
01:36:28.330 --> 01:36:30.830
That is similar to
what we saw before
01:36:32.360 --> 01:36:35.930
It's a supremum now over
the choice of some hypothesis.
01:36:35.930 --> 01:36:38.110
But it's a supremum
over the bold age.
01:36:38.110 --> 01:36:41.300
That means the supremum
over that sequence, h1, h2,
01:36:41.300 --> 01:36:45.670
h sub capital T,
of this weighted difference.
01:36:45.670 --> 01:36:49.220
The gain here,
the weights q sub t are crucial.
01:36:49.220 --> 01:36:53.260
They're going to serve as in
the same as I indicated before,
01:36:53.260 --> 01:36:59.080
to do In short make use of
the fact that some hypothesis,
01:36:59.080 --> 01:37:03.010
h sub t at time t, are much
more useful to my prediction
01:37:03.010 --> 01:37:05.880
time at time t plus
one than others and
01:37:05.880 --> 01:37:08.720
the weights are gonna help me
emphasize or deemphasize them.
01:37:10.260 --> 01:37:13.050
So this is a very natural
measure of non-stationarity or
01:37:13.050 --> 01:37:15.910
dependency as in
the stochastic case.
01:37:15.910 --> 01:37:17.170
As in this stochastic case,
01:37:17.170 --> 01:37:20.840
it captures the hypothesis set and
loss function.
01:37:20.840 --> 01:37:23.180
And as in the stochastic case,
01:37:23.180 --> 01:37:26.330
it can be estimated with
some mild assumptions.
01:37:26.330 --> 01:37:29.980
In fact, pretty much doing exactly
the same as what I indicated before.
01:37:29.980 --> 01:37:33.060
But you'll see that there's
even another way of
01:37:33.060 --> 01:37:37.520
estimating the online discrepancy.
01:37:37.520 --> 01:37:39.180
That is quite interesting.
01:37:39.180 --> 01:37:41.740
So, I will briefly touch on that.
01:37:41.740 --> 01:37:46.030
And this discrepancy definition is
a generalization of the type of
01:37:46.030 --> 01:37:47.980
discrepancy that I
need to do before.
01:37:47.980 --> 01:37:52.920
So quickly speaking,
we could use the same methods for
01:37:52.920 --> 01:37:56.160
estimating the discrepancy
as indicated before.
01:37:56.160 --> 01:37:58.220
For the stochastic case.
01:37:58.220 --> 01:38:00.810
But here there is
an alternative method
01:38:00.810 --> 01:38:04.340
that is very close to
the online learning view.
01:38:04.340 --> 01:38:07.820
Suppose that the last
function is mu-Lipschitz.
01:38:07.820 --> 01:38:11.870
And suppose that there exists
actually an hypothesis h star
01:38:11.870 --> 01:38:14.689
that has a small expected loss.
01:38:15.750 --> 01:38:17.330
At time t plus 1.
01:38:17.330 --> 01:38:20.030
That's the conditional expectation.
01:38:21.210 --> 01:38:24.970
Let's denote by eta that expression.
01:38:24.970 --> 01:38:26.600
Disregard the little q here.
01:38:26.600 --> 01:38:27.660
It should not be here, by the way.
01:38:27.660 --> 01:38:29.090
It's not eta of q, it's just eta.
01:38:30.470 --> 01:38:34.490
So we can assume that there is a
very good solution at time t plus 1.
01:38:34.490 --> 01:38:39.790
That means that that eta, the best
expectation Loss is actually small.
01:38:41.400 --> 01:38:46.690
If that is the case then you can
actually show that the discrepancy,
01:38:46.690 --> 01:38:51.910
the left hand side in this bound,
can be upper bounded by a term,
01:38:51.910 --> 01:38:57.310
the hat term, the discrepancy hat
term, the empirical one, plus
01:38:57.310 --> 01:39:02.140
Mu times that eta
plus a term that is
01:39:02.140 --> 01:39:05.690
small as I indicated before to
alpha and a complexity term.
01:39:05.690 --> 01:39:09.870
So definitely the discrepancy
can be estimated here
01:39:09.870 --> 01:39:13.940
essentially by making use of
the fact that there would exist.
01:39:13.940 --> 01:39:17.730
A good hypothesis there exists,
but I don't need to know it,
01:39:17.730 --> 01:39:21.840
that good hypothesis that
has a small error here.
01:39:21.840 --> 01:39:26.850
Okay, so showing this turns
out to be actually simple.
01:39:26.850 --> 01:39:29.230
You start with the definition
of discrepancy.
01:39:29.230 --> 01:39:34.410
You break it into two parts to make
the discrepancy Expression appear,
01:39:34.410 --> 01:39:35.850
you use the additivity,
01:39:35.850 --> 01:39:38.570
subadditivity of the supremum
to break it into two.
01:39:38.570 --> 01:39:41.972
You use the of the loss, and
01:39:41.972 --> 01:39:46.747
then just the fact that
the q t's sum to 1, okay?
01:39:49.089 --> 01:39:52.554
Now, it turns out That
the online tool patch
01:39:52.554 --> 01:39:56.873
problem here admits a very
nice solution here as well.
01:39:56.873 --> 01:40:00.118
If the loss function L
is a convex loss and
01:40:00.118 --> 01:40:04.318
is upper bounded by capital M,
then simply choosing
01:40:04.318 --> 01:40:09.282
the hypothesis that is the average
of the hypothesis returned
01:40:09.282 --> 01:40:13.015
by the online learning
algorithm Sum of qt ht.
01:40:13.015 --> 01:40:19.640
That sum admits a very good
guarantee in the stochastic way.
01:40:19.640 --> 01:40:26.670
That means, with high probability,
its path-dependent expected loss is
01:40:26.670 --> 01:40:32.300
upper bounded by an average of
the training sample losses,
01:40:32.300 --> 01:40:38.720
the sum of qt L(ht,
Zt) plus the online discrepancy,
01:40:38.720 --> 01:40:43.490
plus a term that depends on
the upper bound of the loss and
01:40:43.490 --> 01:40:47.240
the norm of the weight vector, which
again, you can think of as being
01:40:47.240 --> 01:40:49.090
essentially close to 1
over square root of T.
01:40:51.210 --> 01:40:53.780
So this is a strong
learning guarantee for
01:40:53.780 --> 01:40:57.800
the online to batch because it tells
you that it is possible even in
01:40:57.800 --> 01:41:02.670
this scenario to combine hypotheses
to come up with a good accurate one.
01:41:02.670 --> 01:41:07.646
I'm not gonna go through the proof,
but it's just really two,
01:41:07.646 --> 01:41:08.763
three lines.
01:41:08.763 --> 01:41:13.249
What is even stronger is
the following guarantee,
01:41:13.249 --> 01:41:17.316
again in the case of
the convex loss function,
01:41:17.316 --> 01:41:21.071
which says that for
the same hypothesis h,
01:41:21.071 --> 01:41:26.185
the path-dependent expected
loss can be upper bounded by
01:41:26.185 --> 01:41:31.189
the loss of the best expert,
or the best sequence, H*,
01:41:31.189 --> 01:41:36.613
plus the discrepancy plus a term
that is the regret per round,
01:41:36.613 --> 01:41:41.724
which we showed can be upper
bounded in many cases in online
01:41:41.724 --> 01:41:46.480
learning by a term of the form
1 over square root of T.
01:41:46.480 --> 01:41:50.580
And other terms that you
saw appearing already in
01:41:50.580 --> 01:41:53.480
the analysis that Vitaly showed,
01:41:53.480 --> 01:41:58.780
in terms of the weights that
are used in the stochastic case.
01:41:58.780 --> 01:41:59.550
Okay.
01:41:59.550 --> 01:42:04.210
So the high level idea here
is that online learning
01:42:04.210 --> 01:42:08.920
can be used to produce a whole
sequence of experts, h1, h2 to T.
01:42:08.920 --> 01:42:12.500
Those experts can be combined
using some weights, Q1,
01:42:12.500 --> 01:42:17.920
QT, to come up with a very strong
hypothesis in the stochastic case.
01:42:17.920 --> 01:42:21.870
The guarantee for
that hypothesis is given here,
01:42:21.870 --> 01:42:25.150
same story as before,
we can choose those weights Q1,
01:42:25.150 --> 01:42:29.470
QT in the best way to
minimize that upper bound.
01:42:29.470 --> 01:42:33.390
These are things I will be able to
describe in more details if you
01:42:33.390 --> 01:42:39.120
attend the talk I will be
giving in the workshop.
01:42:39.120 --> 01:42:40.980
I'm going to stop here and
01:42:40.980 --> 01:42:44.460
conclude with the following
remarks about time series.
01:42:44.460 --> 01:42:47.750
Time series forecasting
is a key learning problem
01:42:47.750 --> 01:42:49.630
in a variety of important tasks,
01:42:49.630 --> 01:42:52.830
which I've mentioned many
of them in the beginning.
01:42:53.830 --> 01:42:57.550
Most of the data that
we're facing today
01:42:57.550 --> 01:42:59.760
are in fact sequential in nature.
01:42:59.760 --> 01:43:05.410
So time series would be the natural
way of formulating the problems.
01:43:05.410 --> 01:43:09.240
At the same, time series prediction
in the most general case
01:43:09.240 --> 01:43:13.130
is a challenging problem, both from
the theoretical point of view and
01:43:13.130 --> 01:43:14.650
the algorithmic point of view,
01:43:14.650 --> 01:43:18.310
if you want to come up with good
solutions, accurate solutions and
01:43:18.310 --> 01:43:21.740
solutions from which you
can give guarantees.
01:43:21.740 --> 01:43:22.240
Okay.
01:43:24.060 --> 01:43:25.160
This is, of course,
01:43:25.160 --> 01:43:28.650
a very challenging problem
also in applications.
01:43:28.650 --> 01:43:31.134
And you know that in
a variety of cases,
01:43:31.134 --> 01:43:34.524
people would want to have
very accurate solutions for
01:43:34.524 --> 01:43:37.856
a variety of tasks of this kind for
sequential data.
01:43:37.856 --> 01:43:43.028
What Vitaly and I presented is
a series of learning guarantees,
01:43:43.028 --> 01:43:46.287
data dependent learning
guarantees for
01:43:46.287 --> 01:43:50.310
the very general case of
time series prediction,
01:43:50.310 --> 01:43:54.557
non-stationary, non-mixing,
no assumption.
01:43:54.557 --> 01:43:58.907
The notions of discrepancy that you
heard multiple times in this talk
01:43:58.907 --> 01:44:02.957
are really really crucial for
tackling this problem, both from
01:44:02.957 --> 01:44:07.170
the theoretical point of view and
the algorithmic point of view.
01:44:07.170 --> 01:44:10.530
We discussed algorithms,
guarantees, and
01:44:10.530 --> 01:44:14.310
showed also that these algorithms
in practice can be effective.
01:44:15.470 --> 01:44:22.770
The combination of online learning
and batch, or stochastic prediction,
01:44:22.770 --> 01:44:28.620
leads to a whole other set of very,
very interesting solutions.
01:44:28.620 --> 01:44:33.270
I tried to say a few words about
them to sort of tell you how
01:44:33.270 --> 01:44:34.660
this could be done.
01:44:34.660 --> 01:44:35.620
In particular,
01:44:35.620 --> 01:44:40.050
this can be used to solve problems
that are really difficult.
01:44:40.050 --> 01:44:42.200
We don't really know how to
tackle them otherwise for
01:44:42.200 --> 01:44:45.210
model selection or
learning ensembles.
01:44:45.210 --> 01:44:48.350
Again, these are things that I
could talk about a lot more.
01:44:48.350 --> 01:44:53.680
And let me finish by further
advertising this workshop.
01:44:53.680 --> 01:44:55.660
It's Friday, and it's in Room 117. Okay.
01:44:55.660 --> 01:44:56.575
Thank you.
01:44:56.575 --> 01:45:04.730
[APPLAUSE]