WEBVTT
00:00:04.353 --> 00:00:08.541
[INAUDIBLE] done this work
in the last few years,
00:00:08.542 --> 00:00:12.625
we made reinforcement
learning more mature,
00:00:12.626 --> 00:00:17.451
closer to the real world and
[INAUDIBLE] applications.
00:00:17.452 --> 00:00:20.296
He has worked on many topics
related to that on deep
00:00:20.297 --> 00:00:23.008
reinforcement learning,
on motor planning,
00:00:23.009 --> 00:00:25.989
on planning of trajectories
as rewards, and also on
00:00:25.990 --> 00:00:29.465
the fundamental algorithms
which [INAUDIBLE] talked about.
00:00:29.466 --> 00:00:33.569
And he's going to speak today
on all of these algorithms,
00:00:33.570 --> 00:00:38.275
on the policy gradient methods,
so thanks, John for [INAUDIBLE].
00:00:38.276 --> 00:00:40.560
>> Thanks for the very nice
introduction, Sebastian.
00:00:42.380 --> 00:00:46.060
So I wish I could be out there
in Cambridge with you and
00:00:46.061 --> 00:00:47.490
participate in
the summer school, but
00:00:47.491 --> 00:00:50.100
unfortunately, I'm not
very good with jet lag.
00:00:50.101 --> 00:00:52.500
So I decided to stay home for
this one.
00:00:54.330 --> 00:00:56.558
So this tutorial is going to be,
00:00:56.559 --> 00:01:00.660
this is gonna be a policy
gradient tutorial which is
00:01:02.000 --> 00:01:05.630
closely related to the tutorial
that Pieter Abbeel and
00:01:05.631 --> 00:01:09.820
I gave at NIPS this year but
with some modifications.
00:01:10.860 --> 00:01:14.510
And it's actually gonna
be fairly basic and
00:01:14.511 --> 00:01:18.630
just introduce the basics
of policy gradients,
00:01:18.631 --> 00:01:21.090
evolution strategies,
natural policy gradient.
00:01:22.180 --> 00:01:27.071
But actually, I think the basics
go a really long way for
00:01:27.072 --> 00:01:32.495
these algorithms, so I hope
it's beneficial for all of you.
00:01:36.317 --> 00:01:39.037
So before I get into
the details of some of
00:01:39.038 --> 00:01:42.157
these algorithms that
I'm gonna talk about,
00:01:42.158 --> 00:01:45.517
I wanna just put them into
context a little bit with
00:01:45.518 --> 00:01:49.440
the other algorithms from
deep reinforcement learning.
00:01:52.480 --> 00:01:55.070
So deep reinforcement learning
is just reinforcement learning
00:01:55.071 --> 00:01:57.800
where you're using neural
networks as your function
00:01:57.801 --> 00:01:59.160
approximators.
00:01:59.161 --> 00:02:05.020
And one interesting thing is
that in reinforcement learning,
00:02:05.021 --> 00:02:07.340
there's several different
choices of what to approximate.
00:02:08.980 --> 00:02:11.800
This contrasts with, say,
supervised learning,
00:02:11.801 --> 00:02:14.760
where you just wanna
approximate your conditional
00:02:14.761 --> 00:02:18.320
probability distribution or
something like that.
00:02:18.321 --> 00:02:20.380
So in reinforcement learning,
00:02:20.381 --> 00:02:23.050
there's several
reasonable choices.
00:02:23.051 --> 00:02:25.530
You can try to approximate
a policy which is your function
00:02:25.531 --> 00:02:27.650
that chooses the next action.
00:02:27.651 --> 00:02:31.110
You could try to approximate
a value function,
00:02:31.111 --> 00:02:36.630
which is a function that tells
you how good a certain state is,
00:02:36.631 --> 00:02:38.728
or a certain
state-action pair is.
00:02:38.729 --> 00:02:41.070
Or you could try to learn
the dynamics model,
00:02:41.071 --> 00:02:42.620
which predicts
the next state and
00:02:42.621 --> 00:02:45.830
the reward, which you would then
plug into a planning algorithm.
00:02:47.630 --> 00:02:54.740
So here in this talk, I'm gonna
focus on learning policies.
00:02:54.741 --> 00:02:59.660
But I would say that there's
sort of a taxonomy of
00:02:59.661 --> 00:03:05.080
different methods that sort
of live in this space.
00:03:05.081 --> 00:03:10.430
So on the one hand, you can
directly optimize policies.
00:03:10.431 --> 00:03:16.050
And one way to do that is with
derivative free optimization or
00:03:16.051 --> 00:03:17.820
also known as
evolutionary algorithms,
00:03:17.821 --> 00:03:22.790
where you just try to optimize
over the parameter vector
00:03:22.791 --> 00:03:26.010
without using anything
about the actual
00:03:26.011 --> 00:03:28.450
trajectories generated from
that parameter vector.
00:03:28.451 --> 00:03:32.190
So you're just treating the
whole environment as a black box
00:03:32.191 --> 00:03:35.490
where you put in a parameter
vector and you get a return.
00:03:36.820 --> 00:03:40.100
Then there are these policy
grading methods which actually
00:03:40.101 --> 00:03:42.730
take advantage of the temporal
structure of the problem and
00:03:42.731 --> 00:03:45.405
try to figure out which
are the good actions and
00:03:45.406 --> 00:03:46.735
which are the bad actions.
00:03:46.736 --> 00:03:49.190
Let's make the good
actions more probable.
00:03:49.191 --> 00:03:52.540
Let's adjust our actions so
that they're better.
00:03:52.541 --> 00:03:55.940
So that's sort of what we've got
on the policy optimization side.
00:03:55.941 --> 00:04:00.010
And dynamic programming is
another class of algorithms that
00:04:00.011 --> 00:04:06.460
are based on learning value
functions and in the, so there's
00:04:06.461 --> 00:04:11.770
a set of classic algorithms
that will exactly solve an MDP
00:04:11.771 --> 00:04:16.180
where you have access to all
of the information about
00:04:16.181 --> 00:04:18.926
the transition probabilities and
the reward function.
00:04:18.927 --> 00:04:22.670
So there's these algorithms
that you might have learned
00:04:22.671 --> 00:04:26.258
about in the elementary course
of policy iteration and
00:04:26.259 --> 00:04:29.222
value iteration which
will exactly solve for
00:04:29.223 --> 00:04:32.740
the optimal policy and
the value function for an MDP.
00:04:34.145 --> 00:04:38.099
And it's possible to come up
with approximate versions of
00:04:38.100 --> 00:04:39.476
these algorithms.
00:04:39.477 --> 00:04:44.098
So that's another class of
methods there, and Q learning is
00:04:44.099 --> 00:04:48.627
sort of an approximate version
of, or DQ learning is sort of
00:04:48.628 --> 00:04:52.087
an approximate version
of value iteration.
00:04:52.088 --> 00:04:56.093
But we're going to focus,
here in this talk,
00:04:56.094 --> 00:05:00.838
I'm gonna focus on the policy
optimization methods.
00:05:00.839 --> 00:05:05.505
Well first,
what's the difference between
00:05:05.506 --> 00:05:10.180
empirically and
also sort of conceptually?
00:05:11.510 --> 00:05:15.860
Well, one big
advantage of the policy
00:05:15.861 --> 00:05:18.740
optimization method is you're
optimizing the thing you
00:05:18.741 --> 00:05:21.700
care about which is the expected
return of the policy.
00:05:23.480 --> 00:05:25.000
In contrast,
00:05:25.001 --> 00:05:28.300
the dynamic programming
methods are more indirect.
00:05:28.301 --> 00:05:31.060
They try to exploit
the problem structure
00:05:31.061 --> 00:05:33.570
to learn a value function using
00:05:33.571 --> 00:05:38.350
a self-consistency property that
the value function should have.
00:05:38.351 --> 00:05:40.891
Specifically, that the value
function now should be
00:05:40.892 --> 00:05:43.661
consistent with the value
function of the next time step.
00:05:45.325 --> 00:05:49.988
So they can be So there's
00:05:49.989 --> 00:05:53.853
a slight mismatch between what's
in your loss function which is
00:05:53.854 --> 00:05:56.217
the [INAUDIBLE]
consistency edition and
00:05:56.218 --> 00:06:00.350
what you actually care about
which is the expected return.
00:06:00.351 --> 00:06:04.160
But empirically what happens is
the policy optimization methods
00:06:04.161 --> 00:06:05.380
are a bit more versatile,
00:06:05.381 --> 00:06:09.260
meaning they tend to
work in more situations.
00:06:09.261 --> 00:06:10.630
Whereas the dynamic programming
00:06:10.631 --> 00:06:13.100
methods are more sample
efficient when they do work.
00:06:13.101 --> 00:06:18.108
This is just a very rough
assessment of the empirical
00:06:18.109 --> 00:06:21.580
results and the jury is
still out on which methods
00:06:21.581 --> 00:06:23.600
are going to work the best
in the long run, but
00:06:23.601 --> 00:06:26.160
this is just the current
state of things.
00:06:26.161 --> 00:06:29.880
And also, the policy
optimization methods are more
00:06:29.881 --> 00:06:34.040
compatible with richer
function approximators,
00:06:34.041 --> 00:06:36.896
such as recurring
neural networks,
00:06:36.897 --> 00:06:40.690
and also architectures that
add auxiliary objectives,
00:06:40.691 --> 00:06:43.530
such as making predictions,
in addition to control.
00:06:44.810 --> 00:06:48.720
So that's the advantage of
the policy optimization methods,
00:06:48.721 --> 00:06:52.850
whereas dynamic programming
methods are more compatible with
00:06:52.851 --> 00:06:54.920
off policy learning.
00:06:54.921 --> 00:06:59.082
And in turn, that makes them
more compatible with exploration
00:06:59.083 --> 00:07:03.323
because to do exploration, you
often block once to collect data
00:07:03.324 --> 00:07:07.091
from a policy that other than
the one that's your current
00:07:07.092 --> 00:07:09.383
best guess at
the optimal policy.
00:07:12.391 --> 00:07:16.266
Okay, so that's just an overview
of where policy optimization
00:07:16.267 --> 00:07:18.930
methods sit relevant
to the other methods.
00:07:20.500 --> 00:07:23.960
But now, so now I'm gonna get
more into the details and
00:07:23.961 --> 00:07:26.380
set up what the problem
is more precisely.
00:07:27.430 --> 00:07:32.740
So we're gonna be concerned
with parametrized policies,
00:07:32.741 --> 00:07:37.230
and what that means is you
have a family of policies
00:07:37.231 --> 00:07:39.720
that's indexed with
some parameter vector,
00:07:39.721 --> 00:07:42.640
which in this case is just
a vector of real numbers.
00:07:42.641 --> 00:07:46.170
And this could either be a
deterministic policy where that
00:07:47.280 --> 00:07:49.740
action is a deterministic
function of the state and
00:07:49.741 --> 00:07:52.860
the parameter vector or you
could have a stochastic policy
00:07:52.861 --> 00:07:56.810
where the action is
a conditional probability
00:07:56.811 --> 00:08:00.930
distribution that depends on the
state and the parameter vector.
00:08:03.060 --> 00:08:07.620
And this is totally analogous
to classification of regression
00:08:07.621 --> 00:08:10.870
where the state is your input
and the action is your output.
00:08:10.871 --> 00:08:13.570
And you can use the exact same
kind of function approximators
00:08:13.571 --> 00:08:16.560
that you would use in
classification or regression.
00:08:16.561 --> 00:08:19.550
For example, if your
reinforcement learning problem
00:08:19.551 --> 00:08:23.410
has a discrete action space,
you can take a network
00:08:23.411 --> 00:08:26.160
architecture that would have
been used for classification,
00:08:26.161 --> 00:08:28.600
where your network outputs
affect your probabilities.
00:08:30.700 --> 00:08:33.611
Whereas if you have
a continuous action space,
00:08:33.612 --> 00:08:36.447
you can use a network that
you would have use for
00:08:36.448 --> 00:08:40.307
regression where the network can
be interpreted as outputting
00:08:40.308 --> 00:08:43.681
a mean and a diagonal
covariance matrix for Gaussian.
00:08:49.025 --> 00:08:51.771
Also for the purposes of
this talk, we're going to be
00:08:51.772 --> 00:08:55.920
concerned with the episodic
reinforcement learning setting.
00:08:55.921 --> 00:08:57.890
There are actually several
different ways you can set
00:08:57.891 --> 00:08:59.500
up the optimization problem.
00:08:59.501 --> 00:09:02.212
When it comes down
to the algorithms,
00:09:02.213 --> 00:09:05.095
it actually doesn't
matter that much,
00:09:05.096 --> 00:09:08.500
since the algorithms
look basically the same.
00:09:08.501 --> 00:09:11.560
But we're gonna be
mostly concerned with
00:09:11.561 --> 00:09:15.710
what's called the episodic
setting where we have an agent
00:09:15.711 --> 00:09:19.200
whose experience is broken up
into a series of episodes.
00:09:19.201 --> 00:09:21.270
Each of which has a finite
number of time stamps.
00:09:21.271 --> 00:09:24.890
So in each episode you sample
an initial stage from sub
00:09:24.891 --> 00:09:26.380
distribution Mu.
00:09:26.381 --> 00:09:29.300
And the agent acts until
it reaches a certain
00:09:29.301 --> 00:09:30.100
terminal state.
00:09:31.380 --> 00:09:33.560
At which point the episode ends.
00:09:33.561 --> 00:09:36.220
Just to give you a few examples
of what an episode might look
00:09:36.221 --> 00:09:39.880
like, let's say you
have a taxi robot and
00:09:40.920 --> 00:09:43.500
the episode ends when it
reaches its destination.
00:09:43.501 --> 00:09:45.455
So then termination is good and
00:09:45.456 --> 00:09:48.847
we want the episode to
terminate as fast as possible.
00:09:48.848 --> 00:09:52.067
You might also have a situation
where the episode has a fixed
00:09:52.068 --> 00:09:55.549
length, and you're just trying
to accumulate as much reward as
00:09:55.550 --> 00:09:57.946
possible during that
fixed amount of time.
00:09:57.947 --> 00:10:00.989
So that would be like a waiter
robot finishing where
00:10:00.990 --> 00:10:03.890
the episode ends when it
finishes this shift, or
00:10:03.891 --> 00:10:06.295
you could have another
kind of the rear,
00:10:06.296 --> 00:10:09.550
you want the episode to
be as long as possible.
00:10:09.551 --> 00:10:10.890
Like you have
a walking robot and
00:10:10.891 --> 00:10:15.630
you wanted to us be able to
stand up as long as possible or
00:10:15.631 --> 00:10:17.690
stay out for as long as
possible before falling over.
00:10:19.240 --> 00:10:23.840
So the goal in this setting
is to maximize the expected
00:10:23.841 --> 00:10:24.950
reward per episode.
00:10:24.951 --> 00:10:29.750
So we're gonna use capital
R to denote the return,
00:10:29.751 --> 00:10:33.450
which is to be distinguished
from the reward.
00:10:34.900 --> 00:10:39.030
Cuz the reward corresponds to
a single time step will return
00:10:39.031 --> 00:10:41.140
corresponds to a whole episode.
00:10:41.141 --> 00:10:45.060
So reward will be lowercase
r and return is capital R.
00:10:45.061 --> 00:10:49.572
And we're concerned with
maximizing the expected return,
00:10:49.573 --> 00:10:53.116
which is the sum of reward
in the whole episode.
00:10:55.851 --> 00:11:00.836
So there are various ways of
optimizing this expected return.
00:11:00.837 --> 00:11:01.573
And first,
00:11:01.574 --> 00:11:05.779
I'm gonna talk about what's sort
of conceptually the simplest way
00:11:05.780 --> 00:11:09.660
to do it which is with
Derivative Free Optimization.
00:11:09.661 --> 00:11:14.050
And the idea here is you have
your parameter vector which
00:11:14.051 --> 00:11:15.040
defines your policy.
00:11:16.140 --> 00:11:19.350
And we treat everything
as if it's a black box.
00:11:19.351 --> 00:11:23.400
So actually what's happening
inside the black box is
00:11:23.401 --> 00:11:25.290
there's a series of time steps.
00:11:25.291 --> 00:11:27.320
At each time step you're
sampling an action and
00:11:27.321 --> 00:11:29.648
then generating next
state in reward.
00:11:29.649 --> 00:11:31.000
But we're gonna
ignore all that and
00:11:31.001 --> 00:11:33.650
just say we put the parameter
into that black box,
00:11:33.651 --> 00:11:36.120
get a return which
is stochastic.
00:11:36.121 --> 00:11:39.965
And use that information to try
to adjust our parameter vector.
00:11:39.966 --> 00:11:43.854
So first time we gonna talk just
introduce an algorithm called
00:11:43.855 --> 00:11:45.574
the cross entropy method,
00:11:45.575 --> 00:11:48.840
which is simple to
understand intuitively.
00:11:48.841 --> 00:11:51.630
And that'll give you
an intuition for what's kind of
00:11:51.631 --> 00:11:54.490
going on in these derivative
free optimization methods, and
00:11:54.491 --> 00:11:58.230
then I'll talk a little bit
more about how to justify it.
00:11:59.380 --> 00:12:06.180
The first, just introduce
the algorithm are the main state
00:12:06.181 --> 00:12:11.401
is here meaning
00:12:11.402 --> 00:12:15.555
our current estimated about
the parameter vector.
00:12:15.556 --> 00:12:21.152
The actual parameters is
stored in two vectors,
00:12:21.153 --> 00:12:27.167
we have a vector mean
deviation parameter.
00:12:27.168 --> 00:12:30.350
So we're, that means we have
a Gaussian with diagonal
00:12:30.351 --> 00:12:34.157
co-variants over parameters and
what we're gonna do is
00:12:34.158 --> 00:12:36.870
improve this distribution
over parameters.
00:12:38.480 --> 00:12:40.850
So each iteration,
what we do is,
00:12:40.851 --> 00:12:45.570
first we collect n samples, then
random parameters by sampling
00:12:45.571 --> 00:12:47.000
from this Gaussian distribution.
00:12:48.520 --> 00:12:52.890
Then for each parameter vector,
we perform one episode with it,
00:12:54.850 --> 00:12:56.980
that means we just
run our policy for
00:12:56.981 --> 00:13:00.030
one episode until
the episode terminates.
00:13:00.031 --> 00:13:02.430
And that gives just a return,
that's the type.
00:13:02.431 --> 00:13:06.493
I should just say return,
capital R.
00:13:06.494 --> 00:13:10.387
Then after we've done
these n episodes,
00:13:10.388 --> 00:13:15.961
we take the top p% where p is
a parameter of this algorithm.
00:13:15.962 --> 00:13:22.222
We take the top p%,
like the top 20%, Now we
00:13:22.223 --> 00:13:26.537
fit a new Gaussian distribution
to the elite set and
00:13:26.538 --> 00:13:28.350
update mu and sigma.
00:13:30.170 --> 00:13:33.219
So fitting it, that just means
usually we do maximum likelihood
00:13:33.220 --> 00:13:34.129
on that elite set.
00:13:38.428 --> 00:13:42.334
I mean if you are familiar
with evolutionary algorithms,
00:13:42.335 --> 00:13:46.001
this will be a kind of familiar
structure where you have
00:13:46.002 --> 00:13:49.428
a certain population which
is defined by a Gaussian
00:13:49.429 --> 00:13:50.960
distribution.
00:13:50.961 --> 00:13:54.590
And then you pick the members
of the population that have
00:13:54.591 --> 00:13:59.530
highest fitness and you fit
a new distribution to those
00:13:59.531 --> 00:14:01.600
highly fit members of
the population, and
00:14:01.601 --> 00:14:04.450
then you repeat
the process again.
00:14:04.451 --> 00:14:08.220
So it sort of makes
sense that you're just
00:14:08.221 --> 00:14:10.870
moving your distribution towards
the good parts of parameter
00:14:10.871 --> 00:14:12.260
space over time.
00:14:12.261 --> 00:14:15.590
And so you're doing this for
awhile and then at the very end,
00:14:15.591 --> 00:14:17.960
after you've done all
of your iterations,
00:14:17.961 --> 00:14:23.280
you return the final mean
parameter which is your best
00:14:23.281 --> 00:14:25.610
estimate for
the optimal parameter.
00:14:27.415 --> 00:14:32.710
So this simple algorithm is
called the Cross Entropy Method
00:14:32.711 --> 00:14:38.600
for certain rhetorical reasons
which you said it's related to.
00:14:38.601 --> 00:14:43.089
It's actually a method it was
derived for something else, for
00:14:43.090 --> 00:14:47.748
integral estimation but actually
I won't go into the details for
00:14:47.749 --> 00:14:48.268
that.
00:14:48.269 --> 00:14:52.230
But the funny thing is
sometimes this algorithm works
00:14:52.231 --> 00:14:55.313
embarrassingly well and
in particular So
00:14:55.314 --> 00:14:59.891
the game of Tetris has, for a
long time, been a benchmark task
00:14:59.892 --> 00:15:05.001
in reinforcement learning or in
approximate dynamic programming,
00:15:05.002 --> 00:15:09.895
which is sort of means the same
thing as reinforcement learning.
00:15:09.896 --> 00:15:17.540
And so this table is
taken from the paper,
00:15:17.541 --> 00:15:21.060
Learning Tetris using the noisy
cross entropy method, and
00:15:21.061 --> 00:15:25.180
it shows that there are several
previous papers that
00:15:25.181 --> 00:15:29.470
used reinforcement learning
techniques to play Tetris and
00:15:29.471 --> 00:15:33.175
they got certain scores that
are in the low thousands.
00:15:33.176 --> 00:15:40.840
So it's not the best
performance.
00:15:40.841 --> 00:15:45.160
The best performance is attained
by this hands coded policy or
00:15:45.161 --> 00:15:51.010
by genetic algorithms and
both of which there's
00:15:51.011 --> 00:15:53.320
a lot of hands engineering
in both of these cases.
00:15:53.321 --> 00:15:57.643
And the paper that this table's
taken from showed that if you
00:15:57.644 --> 00:16:01.626
just use the cross entropy
method, you get a really high
00:16:01.627 --> 00:16:06.289
score which is competitive with
the best performance obtained by
00:16:06.290 --> 00:16:09.028
these kind of
handscrafted methods.
00:16:09.029 --> 00:16:12.846
And you get much better
results than any
00:16:12.847 --> 00:16:16.899
of the reinforcement
learning methods.
00:16:18.900 --> 00:16:22.673
So in particular methods that
are based on policy iteration
00:16:22.674 --> 00:16:26.522
and policy gradients, so when
I say reinforcement learning
00:16:26.523 --> 00:16:30.294
methods, I mean methods that
exploit the fact that you're
00:16:30.295 --> 00:16:33.538
trying to solve an entropy and
try to take advantage
00:16:33.539 --> 00:16:37.180
of the temporal
structure of the problem.
00:16:37.181 --> 00:16:42.210
So I think the reason why
this happens is because
00:16:42.211 --> 00:16:46.090
the cross entropy method
is optimizing, well,
00:16:46.091 --> 00:16:48.010
mostly optimizing
the right thing,
00:16:48.011 --> 00:16:51.250
which is the performance of
the parameter vector, over many
00:16:51.251 --> 00:16:54.650
thousands of time steps that
it takes in an episode.
00:16:55.710 --> 00:16:59.020
Whereas, the reinforcement
learning methods have to
00:16:59.021 --> 00:17:00.910
make some approximations.
00:17:00.911 --> 00:17:05.070
So even though they're
exploiting the structure of
00:17:05.071 --> 00:17:08.740
the problem, they end up
optimizing the wrong thing.
00:17:08.741 --> 00:17:11.320
In particular, they have
to introduce what's called
00:17:11.321 --> 00:17:16.151
a discount factor where
you sort of ignore long
00:17:16.152 --> 00:17:19.190
term dependencies and I think
that's probably what kills them.
00:17:20.450 --> 00:17:22.990
But you don't need to worry
about a discount factor with
00:17:22.991 --> 00:17:24.790
the cross entropy method.
00:17:24.791 --> 00:17:26.991
I'm gonna talk about
the discount factor shortly.
00:17:31.590 --> 00:17:34.429
So, anyway, a few years ago,
00:17:34.430 --> 00:17:39.891
there was a paper showing that
by being really careful about
00:17:39.892 --> 00:17:44.808
how you proximate your value
function, you actually
00:17:44.809 --> 00:17:49.630
can get good results on
Tetris using the [INAUDIBLE].
00:17:49.631 --> 00:17:51.729
But I still think it's
sort of a win for
00:17:51.730 --> 00:17:54.815
cross entropy because of how
simple the algorithm is and
00:17:54.816 --> 00:17:57.652
how it works pretty well
without that much tuning.
00:17:59.449 --> 00:18:03.467
So I didn't really justify
the cross entropy method,
00:18:03.468 --> 00:18:07.110
though it sort of makes
sense intuitively.
00:18:07.111 --> 00:18:12.240
But now I'm going to talk about
a very closely related approach
00:18:12.241 --> 00:18:15.809
which can be mathematically
derived pretty
00:18:15.810 --> 00:18:21.390
straightforwardly, which
is to do stochastic
00:18:21.391 --> 00:18:24.900
gradient ascent on the
parameters of your distribution.
00:18:24.901 --> 00:18:27.600
And shortly I'll show how cross
entropy method is closely
00:18:27.601 --> 00:18:28.610
related to this shortly.
00:18:31.230 --> 00:18:35.320
So the idea here is in cross
entropy method, we had this
00:18:35.321 --> 00:18:38.680
Gaussian distribution and
we're categorically moving this
00:18:38.681 --> 00:18:42.150
Gaussian distribution to give
better and better returns.
00:18:42.151 --> 00:18:45.159
So can we just do Stochastic
gradient descent or
00:18:45.160 --> 00:18:49.310
Stochastic gradient ascent on
these distribution parameters?
00:18:51.080 --> 00:18:54.157
So that's what we're
gonna do here.
00:18:54.158 --> 00:19:00.078
So let's say that we
use this distribution
00:19:00.079 --> 00:19:05.673
piece of mu over our
policy parameters.
00:19:07.845 --> 00:19:11.364
Well, mu could be
the mean of a Gaussian or
00:19:11.365 --> 00:19:15.840
it could be the mean of
the standard deviation.
00:19:15.841 --> 00:19:18.980
Let's pretend for now we're just
optimizing over the mean and
00:19:18.981 --> 00:19:21.180
we're gonna fix
standard deviation.
00:19:21.181 --> 00:19:25.020
And the same method can be used
to adjust the standard deviation
00:19:25.021 --> 00:19:27.460
but usually that's not
very useful anyway.
00:19:27.461 --> 00:19:30.127
So let's just say we're
optimizing over the mean of
00:19:30.128 --> 00:19:31.530
a Gaussian distribution.
00:19:33.661 --> 00:19:38.869
So the return depends on the
policy parameter and some noise.
00:19:38.870 --> 00:19:43.572
So let's just say we've defined
everything so that the return is
00:19:43.573 --> 00:19:47.505
its deterministic function
of the parameter vector,
00:19:47.506 --> 00:19:49.655
theta, and some noise zeta.
00:19:49.656 --> 00:19:55.605
So now our optimization problem
can be written Like follow us,
00:19:55.606 --> 00:20:00.994
we're maximizing a view of
the expectation of our return
00:20:00.995 --> 00:20:07.192
where the parameter theta sample
according this distribution.
00:20:07.193 --> 00:20:13.023
So, well this is a little bit
non straightforward because
00:20:13.024 --> 00:20:19.330
our return is unknown and
it's possibly nondifferentiable.
00:20:19.331 --> 00:20:24.815
So we don't know this function R
and it's probably not a nice and
00:20:24.816 --> 00:20:29.310
smooth function of
the parameter vector theta.
00:20:29.311 --> 00:20:30.750
So I mean,
00:20:30.751 --> 00:20:34.650
especially let's say we have
the discrete action space.
00:20:34.651 --> 00:20:39.046
Then clearly,
R isn't gonna be a deterministic
00:20:39.047 --> 00:20:42.269
function of our
policy parameter.
00:20:42.270 --> 00:20:46.473
So it's kind of not obvious
that we can even do this or
00:20:46.474 --> 00:20:49.540
that we can even
differentiate this.
00:20:49.541 --> 00:20:51.760
But as it turns out we can.
00:20:51.761 --> 00:20:56.260
And it's a very basic and
important
00:20:56.261 --> 00:20:59.100
technique called the Score
Function Gradient Estimator.
00:20:59.101 --> 00:21:01.750
Also known as the Likelihood
Ratio Gradient Estimator.
00:21:03.020 --> 00:21:08.030
The way it looks is the gradient
of this expectation
00:21:08.031 --> 00:21:13.740
equals the expectation of grad
log probability times return.
00:21:13.741 --> 00:21:17.703
So what we've done
here is we've just,
00:21:17.704 --> 00:21:22.717
I mean the idea of the gradient
estimator is you're
00:21:22.718 --> 00:21:28.199
trying to estimate the gradient
of this expectation and
00:21:28.200 --> 00:21:32.047
we wanna write that
as an expectation of
00:21:32.048 --> 00:21:35.804
something which we
can then compute.
00:21:35.805 --> 00:21:37.695
We can get a Monte Carlo
estimate also.
00:21:38.850 --> 00:21:42.476
So here we can get our
Monte Carlo estimate by
00:21:42.477 --> 00:21:46.887
just collecting a bunch of
samples and then taking for
00:21:46.888 --> 00:21:51.606
each sample we take gridlock
probability times return.
00:21:51.607 --> 00:21:55.870
And we take the empirical
average of those.
00:21:55.871 --> 00:21:59.720
I'll derive this in a sec, but
before I derive that, let's just
00:21:59.721 --> 00:22:03.890
look at the relationship between
this score function gradient
00:22:03.891 --> 00:22:07.560
estimator and what we were doing
in the cross-entropy method.
00:22:07.561 --> 00:22:09.420
Here Score function
gradient estimator,
00:22:09.421 --> 00:22:12.100
we've got grad log
probability times return.
00:22:13.270 --> 00:22:16.170
Whereas in the cross-entropy
method we have an objective
00:22:16.171 --> 00:22:18.750
that we optimize each iteration.
00:22:18.751 --> 00:22:21.990
I actually didn't write this in
the algorithm, the sudo code.
00:22:21.991 --> 00:22:25.907
But, the way we fit the
distribution to our data points
00:22:25.908 --> 00:22:29.690
is we do max and likelihood or
we maximize the log probability
00:22:29.691 --> 00:22:32.790
of our elite set under
the paramaterized distribution.
00:22:33.893 --> 00:22:39.310
So what that means is, so what
``we have here is if we look at
00:22:39.311 --> 00:22:43.480
all of our samples, not just the
elite set, what we have is log
00:22:43.481 --> 00:22:48.550
probability of the sample times
some function of the return.
00:22:48.551 --> 00:22:52.564
And in this case, what we did in
the cross-entropy method is we
00:22:52.565 --> 00:22:56.601
just took the indicator that
returned R was above thresholds.
00:22:56.602 --> 00:23:00.286
So we could see the zero one.
00:23:00.287 --> 00:23:01.210
And so
00:23:01.211 --> 00:23:06.420
you can see that if we got rid
of this thresholding function f,
00:23:06.421 --> 00:23:11.230
then the objective we're now
maximizing in each batch in each
00:23:11.231 --> 00:23:16.150
iteration of cross-entrophy is
deemed as the objective we've
00:23:16.151 --> 00:23:19.440
differentiated with the scored
function gradient estimator.
00:23:23.320 --> 00:23:26.178
So you can see that they're
gonna do similar updates, but
00:23:26.179 --> 00:23:28.394
maybe the cross- entropy
method we'll do,
00:23:28.395 --> 00:23:31.426
we'll be able to take a bigger
step each iteration because you
00:23:31.427 --> 00:23:33.849
don't have to just take
a single gradient step.
00:23:37.409 --> 00:23:41.438
Actually another illuminating
connection is between the score
00:23:41.439 --> 00:23:43.528
function gradient estimator and
00:23:43.529 --> 00:23:45.850
between doing
finite differences.
00:23:45.851 --> 00:23:50.158
So it turns out that the score
function grading estimator is
00:23:50.159 --> 00:23:54.208
very similar to doing towards
another kind of intuitive
00:23:54.209 --> 00:23:58.173
algorithm that involves
taking a random direction and
00:23:58.174 --> 00:24:02.224
doing a directional derivative
in that direction, and
00:24:02.225 --> 00:24:06.728
then doing a gradient step using
that direction derivative.
00:24:06.729 --> 00:24:10.846
So I'll show you that this
core function gradient
00:24:10.847 --> 00:24:15.552
estimator comes out to be almost
the same as a certain kind
00:24:15.553 --> 00:24:18.570
of direction [INAUDIBLE].
00:24:18.571 --> 00:24:23.410
So see that let's just compute
explicitly what happens with our
00:24:23.411 --> 00:24:26.690
score function gradient
estimator in the Gaussian case
00:24:26.691 --> 00:24:29.080
where we're optimizing over
the mean of the Gaussian.
00:24:30.430 --> 00:24:33.381
So here's our log
probability and
00:24:33.382 --> 00:24:38.071
then when we take the gradient
of the log probability.
00:24:38.072 --> 00:24:43.070
So, it's just parameter minus
mean over standard deviation
00:24:43.071 --> 00:24:46.626
squared and then so
here's our estimator,
00:24:46.627 --> 00:24:50.292
rewardr times the parameter
minus the mean.
00:24:50.293 --> 00:24:53.159
Okay, so it doesn't look so
illuminating yet.
00:24:53.160 --> 00:24:58.150
But now let's say we do
antithetic sampling.
00:24:58.151 --> 00:25:03.403
So if you have a symmetric
distribution such as a Gaussian,
00:25:03.404 --> 00:25:06.562
you can compute
your expectation.
00:25:06.563 --> 00:25:10.320
You can get the same
expectation by collecting
00:25:10.321 --> 00:25:14.269
two samples at a time,
where you collect a random
00:25:14.270 --> 00:25:17.063
sample from your
distribution and
00:25:17.064 --> 00:25:21.493
then the [INAUDIBLE] of that
sample, or the mirror image
00:25:21.494 --> 00:25:25.660
of that sample around
the center of distribution.
00:25:25.661 --> 00:25:29.120
So this is just a general
kind of Monte Carlo
00:25:29.121 --> 00:25:32.010
estimation technique called
antithetic sampling.
00:25:32.011 --> 00:25:36.100
So here, what we do, each step
we take a pair of samples,
00:25:36.101 --> 00:25:41.620
mean plus noise times
standard deviation,
00:25:41.621 --> 00:25:44.300
and mean minus noise
times standard deviation.
00:25:44.301 --> 00:25:49.890
Here, z is from normal zero one,
meaning
00:25:49.891 --> 00:25:55.250
mean zero standard deviation,
I mean, varies as the identity.
00:25:55.251 --> 00:26:01.252
And sigma is, this is our
vector of standard deviations.
00:26:03.373 --> 00:26:06.190
So this is
an element-wise product.
00:26:06.191 --> 00:26:08.840
Or you can just consider
the scalar case for
00:26:08.841 --> 00:26:12.110
now if you don't wanna
worry about vectors.
00:26:12.111 --> 00:26:16.289
So anyway, what we're gonna do
here is write down the score
00:26:16.290 --> 00:26:19.801
function gradient estimator,
but where you used
00:26:19.802 --> 00:26:23.749
antithetic sampling, so
you got two samples at a time.
00:26:23.750 --> 00:26:28.457
So writing [INAUDIBLE]
average of our two samples,
00:26:28.458 --> 00:26:34.061
we have return at one sample
minus return at the other sample
00:26:34.062 --> 00:26:40.130
times our z, which is our noise,
which is just normal zero one.
00:26:41.740 --> 00:26:45.400
So what we have here is,
00:26:45.401 --> 00:26:48.430
we have z is effectively
just a random direction.
00:26:49.750 --> 00:26:51.904
Because we just sampled
a random vector.
00:26:51.905 --> 00:26:56.713
And then we look at
the return measured
00:26:56.714 --> 00:27:01.672
at correct mean plus z and
mean minus z,
00:27:01.673 --> 00:27:04.844
scaled appropriately.
00:27:04.845 --> 00:27:09.330
So, this is just our
finite difference.
00:27:09.331 --> 00:27:15.670
And we're dividing by the size
of the probation sigma.
00:27:17.970 --> 00:27:20.878
So here, this is just our
directional derivative.
00:27:20.879 --> 00:27:24.837
And then we're multiplying it
by the random direction that
00:27:24.838 --> 00:27:26.825
the derivative is taking it.
00:27:26.826 --> 00:27:30.520
So this is the kind of intuitive
algorithm, sometimes called
00:27:30.521 --> 00:27:34.011
simultaneous perturbation
stochastic approximation.
00:27:34.012 --> 00:27:35.593
It's not exactly, I don't know,
00:27:35.594 --> 00:27:37.580
I don't wanna get into
specifics of that.
00:27:37.581 --> 00:27:39.941
But this is a kind of
intuitive algorithm.
00:27:43.988 --> 00:27:47.483
And you can see that it's
exactly what you get by doing
00:27:47.484 --> 00:27:51.214
the square function gradient
estimator with antithetic
00:27:51.215 --> 00:27:52.000
sampling.
00:27:54.040 --> 00:27:58.781
And yeah, so if you use
different noise samples to do
00:27:58.782 --> 00:28:01.359
both function evaluations,
00:28:01.360 --> 00:28:05.807
then you could potentially
have a lot of variance.
00:28:05.808 --> 00:28:10.847
Because it could be that
most of the difference
00:28:10.848 --> 00:28:16.146
between these returns is
coming from the noise,
00:28:16.147 --> 00:28:22.235
and it's not coming from
the difference in parameter.
00:28:22.236 --> 00:28:28.470
So you end up having an error
term that goes like 1/ sigma.
00:28:28.471 --> 00:28:32.340
So if you have a very small
sigma, and you use two different
00:28:32.341 --> 00:28:37.680
noise samples, then you
get a pretty bad estimate.
00:28:37.681 --> 00:28:42.548
But if it's the same noise
sample, so it's zeta equals zeta
00:28:42.549 --> 00:28:46.567
prime here, then you get
a much better estimate.
00:28:46.568 --> 00:28:52.565
Because what happens is, well,
under almost all conditions,
00:28:52.566 --> 00:28:57.573
these two terms are gonna
be positively correlated.
00:28:59.706 --> 00:29:04.897
So meaning when you take
an expectation over zeta,
00:29:04.898 --> 00:29:10.460
these terms are gonna be
very positively correlated.
00:29:10.461 --> 00:29:14.210
So if we take the difference of
them, we're gonna cancel out
00:29:14.211 --> 00:29:19.400
some of the noise
that comes from zeta.
00:29:19.401 --> 00:29:20.210
So anyway,
00:29:20.211 --> 00:29:26.300
we get much lower variance if
we're able to freeze the noise.
00:29:26.301 --> 00:29:32.870
And so if we have
access to a simulator,
00:29:32.871 --> 00:29:36.050
for example, then we can use
the same random seed and
00:29:36.051 --> 00:29:39.280
evaluate two different parameter
vectors for that random seed.
00:29:39.281 --> 00:29:44.016
Of course, in the real world,
we can't do that.
00:29:44.017 --> 00:29:47.170
Okay, so basically,
00:29:47.171 --> 00:29:52.482
that's sort of three
different views
00:29:52.483 --> 00:29:56.971
on During pre-optimization.
00:29:56.972 --> 00:30:03.871
And we have these kinda similar
algorithms that we get.
00:30:03.872 --> 00:30:06.883
I'll just quickly show how
you derive the score function
00:30:06.884 --> 00:30:10.156
gradient estimator, in case
you haven't seen this before.
00:30:10.157 --> 00:30:14.071
So this is the estimator that
I wrote on the previous slide.
00:30:14.072 --> 00:30:18.907
So you can just derive it
by writing the expectation
00:30:18.908 --> 00:30:23.855
as an integral and then you
just swap the integral and
00:30:23.856 --> 00:30:27.678
the gradient and
you can do that whenever
00:30:27.679 --> 00:30:32.191
the probability
distribution is continuous.
00:30:32.192 --> 00:30:34.250
Actually it doesn't even
have to be differential,
00:30:34.251 --> 00:30:36.277
it just has to be almost
everywhere differential.
00:30:36.278 --> 00:30:44.603
So, as long as it's almost
everywhere differential you can
00:30:46.072 --> 00:30:50.817
derivative and
then you get this.
00:30:50.818 --> 00:30:54.567
Then what you have to do is,
so by the chain rule,
00:30:54.568 --> 00:30:57.403
when you take grad
log probability,
00:30:57.404 --> 00:31:02.078
that's equal to grad probability
divided by probability.
00:31:02.079 --> 00:31:06.794
So what we've done here
going from equation two to
00:31:06.795 --> 00:31:11.512
equation three here is
we've just multiplied and
00:31:11.513 --> 00:31:15.119
divided by
the probability density.
00:31:15.120 --> 00:31:18.412
Okay, so, then the reason
we did that is so
00:31:18.413 --> 00:31:22.357
we can turn the integral
back into an expectation.
00:31:22.358 --> 00:31:26.382
Because we need to have
some probability measure
00:31:26.383 --> 00:31:28.940
multiplying the integrands.
00:31:28.941 --> 00:31:32.975
So we turn it back into an
expectation and now we got this
00:31:32.976 --> 00:31:36.854
expectation of grad log
probability times return.
00:31:39.385 --> 00:31:40.782
So, that's, okay,
00:31:40.783 --> 00:31:44.539
that's my overview on
the derivative pre-optimization.
00:31:44.540 --> 00:31:50.616
There are several different
versions that which are similar.
00:31:50.617 --> 00:31:55.670
Actually the algorithm where
you do random direction
00:31:55.671 --> 00:32:00.950
finite differences and
you use stochastic gradient and
00:32:00.951 --> 00:32:05.554
that is usually called
evolution strategies and
00:32:05.555 --> 00:32:07.811
this is from the 70s.
00:32:07.812 --> 00:32:13.331
Then it is also related
to this SPSA algorithm.
00:32:13.332 --> 00:32:17.966
So cross entropy method has
become pretty popular for
00:32:17.967 --> 00:32:22.498
applications where you have
a fairly small number of
00:32:22.499 --> 00:32:23.736
parameters.
00:32:23.737 --> 00:32:28.495
And there's actually another
version of it which is even more
00:32:28.496 --> 00:32:33.531
popular called covariance matrix
adaptation, where instead of
00:32:33.532 --> 00:32:38.577
using a diagonal covariance, you
use a full covariance matrix.
00:32:38.578 --> 00:32:41.832
But there are also some other
tricks which help get a better
00:32:41.833 --> 00:32:44.624
estimate of this full
matrix because otherwise,
00:32:44.625 --> 00:32:47.824
you wouldn't be able to estimate
it from a single patch.
00:32:47.825 --> 00:32:52.032
So CMA, covariance matrix
adaptation is a way of
00:32:52.033 --> 00:32:56.045
[INAUDIBLE] updating
your covariance matrix.
00:32:56.046 --> 00:33:00.182
And this is very effective in
a lot of problems where you have
00:33:00.183 --> 00:33:03.096
this fairly small
number of parameters.
00:33:03.097 --> 00:33:07.996
And these are also pretty
closely related to some
00:33:07.997 --> 00:33:13.734
algorithms that have been more
focused on the robotics or
00:33:13.735 --> 00:33:17.322
reinforcement learning setting,
00:33:17.323 --> 00:33:22.237
reward rated regression and
power algorithms.
00:33:22.238 --> 00:33:26.620
So, CMA has actually
been very popular and
00:33:26.621 --> 00:33:32.344
successful for optimizing
locomotion controllers for
00:33:32.345 --> 00:33:36.605
simulated robots and
also a little bit for
00:33:36.606 --> 00:33:40.763
real robots, but
I won't go into that.
00:33:40.764 --> 00:33:44.996
So, for example,
there's this long running
00:33:44.997 --> 00:33:49.229
challenge of playing soccer,
robot soccer,
00:33:49.230 --> 00:33:53.799
which is done in simulation and
in the real world.
00:33:53.800 --> 00:33:58.651
But the leading teams in
the simulation league,
00:33:58.652 --> 00:34:04.092
what they do is they optimize
a bunch of controllers for
00:34:04.093 --> 00:34:08.366
walking and for
getting up off the ground.
00:34:08.367 --> 00:34:12.404
First, they parameterize
this controller, so
00:34:12.405 --> 00:34:16.836
they have some kind of 20
parameter thing, which is
00:34:16.837 --> 00:34:21.955
a trajectory which is gonna be
followed with PID where you have
00:34:21.956 --> 00:34:26.698
to tune a bunch of targets and
gains for PID controllers.
00:34:26.699 --> 00:34:31.396
So they parameterize this thing,
some kind of trajectory, and
00:34:31.397 --> 00:34:34.792
then use CMA to optimize
over the parameters.
00:34:34.793 --> 00:34:40.327
And they do this for walking and
getting up, and so on.
00:34:40.328 --> 00:34:46.655
So and then they use CMA
to optimize this parameter.
00:34:46.656 --> 00:34:48.143
And it's also used for
00:34:48.144 --> 00:34:51.597
optimizing locomotion
controllers, more in the.
00:34:51.598 --> 00:34:54.424
Projects of computer graphics
where you're trying to get
00:34:54.425 --> 00:34:57.410
physically realistic
walking controllers.
00:34:57.411 --> 00:35:01.950
So, it's been quite successful
in this kind of application.
00:35:01.951 --> 00:35:05.977
There's a recent paper showing
that Evolution Strategies
00:35:05.978 --> 00:35:09.117
performs pretty well on
the Atari benchmark,
00:35:09.118 --> 00:35:12.834
which has been used for
deep reinforcement learning.
00:35:13.841 --> 00:35:16.656
So the previously Q learning and
00:35:16.657 --> 00:35:21.989
policy gradient methods have
been shown to do well on Atari.
00:35:21.990 --> 00:35:26.110
And Salimans at all showed that
if you just used the evolution
00:35:26.111 --> 00:35:30.840
strategy using the same kind
of network which was used for
00:35:30.841 --> 00:35:35.500
policy gradients it actually
is competitive performance.
00:35:35.501 --> 00:35:39.198
They might be a little bit
worse, in terms of sample
00:35:39.199 --> 00:35:44.899
complexity, but actually, well,
it's much easier to paralyze.
00:35:44.900 --> 00:35:51.300
So it might have its advantages,
in some cases.
00:35:51.301 --> 00:35:55.050
And it's also not as,
us sensitive to certain,
00:35:55.051 --> 00:36:00.027
it doesn't require as many sort
of approximations as policy
00:36:00.028 --> 00:36:04.830
gradient methods where
you have to worry about
00:36:04.831 --> 00:36:08.620
like skip frame skip you
have to repeat actions for
00:36:08.621 --> 00:36:11.230
multiple frames otherwise
It doesn't learn very well.
00:36:12.370 --> 00:36:15.430
Or and policy grading methods
you also have to worry about,
00:36:15.431 --> 00:36:17.880
you have to introduce
discount factors.
00:36:17.881 --> 00:36:21.460
So the nice thing about
evolution strategies is,
00:36:21.461 --> 00:36:24.615
or the black box
optimization approach is,
00:36:24.616 --> 00:36:28.640
you don't have to worry about
the time horizon as much.
00:36:28.641 --> 00:36:31.751
So you don't have to worry about
what's called credit assignment,
00:36:31.752 --> 00:36:34.165
where you're trying to figure
out which actions were
00:36:34.166 --> 00:36:35.506
responsible for the awards,
00:36:35.507 --> 00:36:38.310
since you're just directly
optimizing in parameter space.
00:36:40.460 --> 00:36:42.991
Okay, so that's it for
evolution strategies.
00:36:42.992 --> 00:36:48.124
I would stop for questions, but
I'm not sure if It's going to
00:36:48.125 --> 00:36:53.552
work very well, because I can't
see everyone, and I'm not sure
00:36:53.553 --> 00:36:58.903
if their microphones, I'm not
sure if I'd be able to hear you.
00:36:58.904 --> 00:37:01.060
Okay, yeah.
00:37:01.061 --> 00:37:03.766
Yeah, then why don't I just,
I'll just open it up, if anyone
00:37:03.767 --> 00:37:06.081
has any questions before we
go on to the next section.
00:37:11.138 --> 00:37:12.291
[LAUGH] All right,
I think we are ready to go to
00:37:12.292 --> 00:37:12.800
the next section.
00:37:12.801 --> 00:37:17.760
[LAUGH] Okay,
to the next section, great.
00:37:17.761 --> 00:37:18.260
So, Policy Gradient Methods.
00:37:20.094 --> 00:37:23.684
Okay, so the problem we're
trying to solve again,
00:37:23.685 --> 00:37:26.377
is to optimize these
effected reward,
00:37:26.378 --> 00:37:29.162
expected return of
a parameter policy.
00:37:29.163 --> 00:37:32.224
And what we're going
to do here for
00:37:32.225 --> 00:37:37.822
policy gradient method is we're
gonna have a policy parameter
00:37:37.823 --> 00:37:42.902
theta which is fixed at any
given point in the algorithm.
00:37:42.903 --> 00:37:45.791
And we have to do it overtime,
but instead of keeping
00:37:45.792 --> 00:37:48.944
track of the distribution
over policy parameters we're
00:37:48.945 --> 00:37:52.043
gonna have a single policy
parameter at any given time.
00:37:52.044 --> 00:37:55.110
And we're gonna estimate
the gradient of this expectation
00:37:55.111 --> 00:37:57.870
with respect to
the policy parameter.
00:37:57.871 --> 00:38:00.110
So, one way to think
about this is,
00:38:00.111 --> 00:38:04.082
now our noise is in action space
instead of in parameter space.
00:38:07.528 --> 00:38:11.116
So the intuition of what we're
going to do is we collected
00:38:11.117 --> 00:38:15.750
a bunch of trajectories to use
using our parameterized policy.
00:38:15.751 --> 00:38:20.110
And, well, there's three
different ways to thinking of it
00:38:20.111 --> 00:38:21.228
which are all correct.
00:38:21.229 --> 00:38:24.800
And correspond tells you
different kinds of methods.
00:38:24.801 --> 00:38:27.570
One is we try to make the good
trajectories more probable.
00:38:29.800 --> 00:38:32.340
So, if we do that,
00:38:32.341 --> 00:38:35.160
we don't have to worry about
the individual actions.
00:38:35.161 --> 00:38:36.420
We just have a whole
trajectory and
00:38:36.421 --> 00:38:38.490
we're trying to make the good
ones more probable and
00:38:38.491 --> 00:38:39.570
the bad ones less probable.
00:38:41.050 --> 00:38:45.340
Okay, then what we can try to
do to get better policy grade
00:38:45.341 --> 00:38:48.250
estimates, is figure out
which actions were good and
00:38:48.251 --> 00:38:49.360
which ones were bad.
00:38:49.361 --> 00:38:51.980
And then we try to make the good
actions more probable and
00:38:51.981 --> 00:38:53.110
the bad actions less probable.
00:38:55.090 --> 00:38:57.920
And then there's a third
class of methods which
00:38:59.690 --> 00:39:02.190
actually try to.
00:39:02.191 --> 00:39:04.580
These only really make sense if
you have a continuous action
00:39:04.581 --> 00:39:08.990
space, and
these methods will try to
00:39:08.991 --> 00:39:11.970
push your actions towards
slightly better actions.
00:39:15.020 --> 00:39:18.256
So you try to estimate.
00:39:18.257 --> 00:39:21.045
These require some kind of value
function which will tell you
00:39:21.046 --> 00:39:22.465
how good a given action is, and
00:39:22.466 --> 00:39:24.390
it'll push that toward
a better action.
00:39:24.391 --> 00:39:27.010
So I don't think I'm gonna
have time to get into those,
00:39:27.011 --> 00:39:30.760
but that's another
interesting class of methods.
00:39:30.761 --> 00:39:37.982
So now I'll get into how you
actually do that mathematically.
00:39:37.983 --> 00:39:42.432
And the way it works, we're
gonna use the same kind of score
00:39:42.433 --> 00:39:47.332
function grading estimator that
we used in the derivative-free
00:39:47.333 --> 00:39:49.215
optimization setting.
00:39:49.216 --> 00:39:51.044
Except now our random variable,
00:39:51.045 --> 00:39:53.140
instead of being
a faulty parameter.
00:39:53.141 --> 00:39:54.471
Is the whole trajectory.
00:39:54.472 --> 00:39:57.866
Remember how I said that the
noise now is an action space or
00:39:57.867 --> 00:40:00.810
trajectory space rather
than parameter space.
00:40:00.811 --> 00:40:05.640
So now our random variable
is the whole trajectory.
00:40:05.641 --> 00:40:09.685
If we just go through
the motions to derive the score
00:40:09.686 --> 00:40:14.295
function estimator, we get
a perfectly estimator here,
00:40:14.296 --> 00:40:18.528
which says that the creating
of the expected return is
00:40:18.529 --> 00:40:23.232
just the expectation of gradual
probability trajectory,
00:40:23.233 --> 00:40:27.559
times return trajectory,
which is just recall part,
00:40:27.560 --> 00:40:30.408
return is just
summary of rewards.
00:40:30.409 --> 00:40:35.955
So, that's not so
useful by itself because we've
00:40:35.956 --> 00:40:41.502
got this probability of
the whole trajectory and,
00:40:41.503 --> 00:40:45.243
by the way, it's worth talking,
00:40:45.244 --> 00:40:49.629
at this point,
about which variables or
00:40:49.630 --> 00:40:54.274
which quantities
are assumed to be known.
00:40:54.275 --> 00:40:57.036
And which ones are not known,
00:40:57.037 --> 00:41:00.912
when we're trying to
run our algorithm.
00:41:00.913 --> 00:41:05.940
So for example,
I mean you have to assume.
00:41:09.780 --> 00:41:12.180
I've just written down
this optimization problem.
00:41:12.181 --> 00:41:13.940
But I haven't said
anything about
00:41:16.020 --> 00:41:20.420
whether we know what the MDP
parameters are and so on.
00:41:20.421 --> 00:41:23.380
So do we know what the
transition probabilities are?
00:41:23.381 --> 00:41:26.000
And do we know what
the reward functioning is?
00:41:26.001 --> 00:41:32.710
Or are these things quantities
that we have to estimate?
00:41:33.800 --> 00:41:36.970
We're going to be interested
here in the setting that,
00:41:36.971 --> 00:41:39.470
we don't know what the
transition probabilities are,
00:41:39.471 --> 00:41:42.790
and we don't know what
the reward function is.
00:41:42.791 --> 00:41:45.910
But we have to learn anyway.
00:41:45.911 --> 00:41:50.000
So, our learning
algorithm has to either
00:41:50.001 --> 00:41:53.730
learn these qualities or somehow
just learn the policy without
00:41:53.731 --> 00:41:55.180
having access to
those quantities.
00:41:56.310 --> 00:42:02.000
So actually, it's worth
appreciating that well, this
00:42:02.001 --> 00:42:05.190
is kinda nontrivial, but even if
you have this problem with all
00:42:05.191 --> 00:42:08.810
sorts of unknown quantities that
we can still optimize the policy
00:42:08.811 --> 00:42:13.060
for it, even though we don't
know what the dynamics model is.
00:42:13.061 --> 00:42:17.700
Which exist inside
of this expectation
00:42:17.701 --> 00:42:21.050
because it determines what our
trajectories are gonna be.
00:42:21.051 --> 00:42:24.557
So it's just worth noting
that this is quite different
00:42:24.558 --> 00:42:28.141
from other regime wherein
with another kinds of machine
00:42:28.142 --> 00:42:31.575
learning, most other machine
learning problems not
00:42:31.576 --> 00:42:33.042
a bandwith problems.
00:42:33.043 --> 00:42:36.241
So bandwith problems you
assume you have this unknown
00:42:36.242 --> 00:42:39.740
distribution of returns
from the different arms.
00:42:39.741 --> 00:42:43.150
Something like supervised
learning usually you
00:42:43.151 --> 00:42:46.490
have a finding set
of training data.
00:42:46.491 --> 00:42:49.958
But given your training data,
you have analytic access to
00:42:49.959 --> 00:42:52.870
the full objective you're
trying to maximize.
00:42:52.871 --> 00:42:55.359
So you can compute gradients
of your objective.
00:42:55.360 --> 00:42:57.630
And even second derivatives,
and so on.
00:42:59.140 --> 00:43:03.460
Like here, we don't have
the analytic access to this
00:43:03.461 --> 00:43:07.206
expectation that we're
trying to optimize.
00:43:07.207 --> 00:43:10.970
It has quantities in
it that we don't know.
00:43:10.971 --> 00:43:14.625
So, when we write out the
probability of the trajectory,
00:43:14.626 --> 00:43:18.205
we can just use the chain rule
probabilities to write it as
00:43:18.206 --> 00:43:18.962
a product.
00:43:18.963 --> 00:43:23.732
The probability of the initial
state, that's not,
00:43:23.733 --> 00:43:28.184
times the probability,
the products overtime,
00:43:28.185 --> 00:43:34.020
the probability of action times
probability of x stated or more.
00:43:35.480 --> 00:43:38.590
Okay so, we know that
probability, we have access to
00:43:38.591 --> 00:43:41.230
this part of probability
action given state and
00:43:41.231 --> 00:43:46.970
parameter vector because
the policy is part of our agent.
00:43:46.971 --> 00:43:49.478
This is what we're learning.
00:43:49.479 --> 00:43:53.640
So we can think derivatives and
so forth of this.
00:43:53.641 --> 00:43:56.464
But we don't know what the
transition probabilities are.
00:43:56.465 --> 00:44:01.298
We don't know what
the initial distribution is.
00:44:02.360 --> 00:44:07.061
So that's gonna cause
trouble if we wanted to
00:44:07.062 --> 00:44:12.515
somehow analytically
evaluate this expectation.
00:44:12.516 --> 00:44:19.000
But we could of course do
Monte Carlo estimates of it.
00:44:19.001 --> 00:44:20.210
And that's what we have to do.
00:44:22.260 --> 00:44:25.410
Okay, so when we take
the log of this probability
00:44:25.411 --> 00:44:27.340
it turns the product into a sum.
00:44:27.341 --> 00:44:31.499
And then we take the gradient
of this log probability.
00:44:31.500 --> 00:44:36.207
And now all of the terms that
don't depend on theta drop out
00:44:36.208 --> 00:44:37.946
of this expression.
00:44:37.947 --> 00:44:45.029
So that's nice because all
the terms that has disappeared.
00:44:45.030 --> 00:44:47.324
And then we're left with, and
00:44:47.325 --> 00:44:52.275
you're then just plugging that
back into our estimator formula.
00:44:52.276 --> 00:44:53.764
We get the following.
00:44:53.765 --> 00:44:55.191
We have.
00:44:55.192 --> 00:44:59.840
Expectation over
trajectories of return times
00:44:59.841 --> 00:45:04.264
the gradient of
the probabilities.
00:45:04.265 --> 00:45:12.650
So this formula, let's just look
at this formula for a second.
00:45:12.651 --> 00:45:16.437
This formula means that you,
so you can get a Monte Carlo
00:45:16.438 --> 00:45:19.903
estimate of the expected
return by taking a bunch of
00:45:19.904 --> 00:45:23.626
trajectories, generating
a bunch of trajectories.
00:45:23.627 --> 00:45:27.263
And then for each one, we take
the return of that trajectory
00:45:27.264 --> 00:45:29.882
and then we take the grad
log probability of
00:45:29.883 --> 00:45:32.431
the actions along
those trajectories and
00:45:32.432 --> 00:45:35.443
we multiply return times
grad log probability.
00:45:37.580 --> 00:45:41.860
And that gives us our empirical
estimate policy gradient.
00:45:41.861 --> 00:45:46.254
And the interpretation is,
sometimes we're gonna get
00:45:46.255 --> 00:45:51.210
a high return, sometimes
we're gonna get a low return.
00:45:51.211 --> 00:45:54.120
We want to maximize
the log probability
00:45:54.121 --> 00:45:56.280
of the samples that
have a high return.
00:45:56.281 --> 00:46:00.941
So we're gonna move our
distribution over to put
00:46:00.942 --> 00:46:04.920
more probability mass
on those samples or
00:46:04.921 --> 00:46:08.794
probability density
on those samples.
00:46:08.795 --> 00:46:16.070
Okay, so that's the most kind of
gradient estimator we can get.
00:46:16.071 --> 00:46:20.520
So what we can do
now is try to derive
00:46:20.521 --> 00:46:22.500
more variance
grading estimators.
00:46:22.501 --> 00:46:26.260
And that's the major that
has the same mean but,
00:46:26.261 --> 00:46:29.019
it's great to have
more variance.
00:46:29.020 --> 00:46:31.585
So it's gonna be
a better estimator and
00:46:31.586 --> 00:46:35.410
it's gonna require fewer
samples to get a good estimate.
00:46:37.230 --> 00:46:42.739
The way we do that is we
can run through the exact
00:46:42.740 --> 00:46:48.250
same argument to get
the grading estimator for
00:46:48.251 --> 00:46:50.947
a single reward term.
00:46:50.948 --> 00:46:54.191
In the previous slide,
we were trying to get the grad
00:46:54.192 --> 00:46:56.570
expectation of
the summer rewards.
00:46:56.571 --> 00:46:59.940
Now we're just going to get grad
expectation of a single reward.
00:46:59.941 --> 00:47:03.453
So grad expectation of
this r sub t prime.
00:47:03.454 --> 00:47:05.822
So just doing the motions,
00:47:05.823 --> 00:47:10.354
we get that it's equal to r
sub t prime times the sum of
00:47:10.355 --> 00:47:14.580
grad log probability of
the preceding actions.
00:47:17.270 --> 00:47:21.559
Because the past actions
aren't going to affect
00:47:23.040 --> 00:47:24.240
this prefered a null.
00:47:24.241 --> 00:47:29.683
So we don't need to worry
about the subsequent actions.
00:47:29.684 --> 00:47:34.149
So, then we wanna get the grad
expectation of the summer
00:47:34.150 --> 00:47:35.959
rewards, capital R.
00:47:35.960 --> 00:47:39.790
So we just sum this
formula over T prime.
00:47:39.791 --> 00:47:43.519
Sorry, I said sum it over T,
I meant sum it over T prime.
00:47:43.520 --> 00:47:48.982
So okay, that leaves us, so
you just do the bookkeeping
00:47:48.983 --> 00:47:54.448
there and what we get is the sum
of reward sub T prime times
00:47:54.449 --> 00:47:59.935
the sum of grad log probability
of proceeding actions.
00:47:59.936 --> 00:48:05.227
Or you can rearrange it to
the grad log probability
00:48:05.228 --> 00:48:10.267
times the sum of the rewards
after that action,
00:48:10.268 --> 00:48:13.970
so sum of following rewards.
00:48:13.971 --> 00:48:17.990
So the interpretation of this
formula is if you get a high
00:48:17.991 --> 00:48:21.510
rewards, that means that your
preceding actions were good,
00:48:21.511 --> 00:48:23.099
so you should make
them more probable.
00:48:24.780 --> 00:48:28.366
This formula says that if,
for a given action,
00:48:28.367 --> 00:48:31.594
if the sum of following
rewards were high,
00:48:31.595 --> 00:48:35.911
then we should increase
the probability of that action.
00:48:35.912 --> 00:48:37.714
And these formulas,
00:48:37.715 --> 00:48:41.727
it's just a trivial
rearrangement of the sum.
00:48:41.728 --> 00:48:45.462
It's just bookkeeping to you
switch between these two
00:48:45.463 --> 00:48:46.281
formulas.
00:48:47.845 --> 00:48:51.724
So this is a better estimate
of the policy gradient which
00:48:51.725 --> 00:48:53.299
respects causality.
00:48:56.082 --> 00:49:00.144
Now we can reduce the variance.
00:49:00.145 --> 00:49:03.752
So the name of the game is
variance reduction because
00:49:03.753 --> 00:49:06.712
the policy gradient is
what we care about.
00:49:06.713 --> 00:49:11.680
And now we just wanna get
estimated as well as possible,
00:49:11.681 --> 00:49:16.649
as fast as possible, which
means reducing the variance
00:49:16.650 --> 00:49:20.050
as much as possible
of our estimator.
00:49:22.580 --> 00:49:25.450
One really good variance
reduction technique is to
00:49:25.451 --> 00:49:26.630
introduce what's
called a baseline.
00:49:28.380 --> 00:49:31.750
So the intuition behind
the baseline is,
00:49:33.750 --> 00:49:38.408
in our previous formula,
let's look at the last formula.
00:49:38.409 --> 00:49:42.124
We've got grad log probability
of the actions times the sum
00:49:42.125 --> 00:49:43.616
of following rewards.
00:49:43.617 --> 00:49:48.248
So if we get and make
an empirical estimate of this,
00:49:48.249 --> 00:49:52.560
we're going to adjust
the probability of every
00:49:52.561 --> 00:49:55.707
single action we've performed.
00:49:55.708 --> 00:49:58.890
In our sample trajectories
to try to increase it.
00:50:00.210 --> 00:50:01.660
Let me say that again.
00:50:01.661 --> 00:50:06.402
So we collect a bunch of
trajectories then we plug
00:50:06.403 --> 00:50:10.343
those trajectories
into this formula.
00:50:10.344 --> 00:50:14.423
And so that means we would take
gridlock probability with each
00:50:14.424 --> 00:50:16.783
action times the sum
of our awards.
00:50:16.784 --> 00:50:19.643
And we take that gradient
at descent step.
00:50:19.644 --> 00:50:22.908
So that means we're increasing
the probability of, sorry,
00:50:22.909 --> 00:50:25.481
let's say all our our
rewards are positive, but
00:50:25.482 --> 00:50:29.338
let's just suppose they're
positive in a certain problem.
00:50:29.339 --> 00:50:30.102
That then,
00:50:30.103 --> 00:50:34.080
where our gradient descent
step is gonna try to increase
00:50:34.081 --> 00:50:37.280
the probability of every
single action we just took.
00:50:37.281 --> 00:50:40.490
But that seems like a bad idea,
00:50:40.491 --> 00:50:42.750
because some of the actions were
good, and some of them were bad.
00:50:42.751 --> 00:50:45.300
In fact, half of them were good,
and half of them were bad.
00:50:45.301 --> 00:50:50.050
So the idea behind the baseline
is, we want to only increase
00:50:50.051 --> 00:50:54.903
the probability of an action if
it was better than expected.
00:50:54.904 --> 00:50:59.213
So, only about half of
the actions were better than
00:50:59.214 --> 00:51:00.223
expected.
00:51:00.224 --> 00:51:03.720
So we should only increase the
probability of the good ones.
00:51:03.721 --> 00:51:07.989
So now instead of just
multiplying gradient by
00:51:07.990 --> 00:51:13.024
probability by this following
rewards, we [INAUDIBLE]
00:51:13.025 --> 00:51:18.277
minus some baseline where
the baseline was the estimated
00:51:18.278 --> 00:51:23.202
what will turn out to be
a great choice for the baseline
00:51:23.203 --> 00:51:27.922
is the expected return
before taking that action.
00:51:27.923 --> 00:51:32.787
So okay, so one nice fact
is that for any choice of
00:51:32.788 --> 00:51:38.140
baseline the gradient
estimator we get is unbiased.
00:51:42.680 --> 00:51:47.612
I won't show the derivation
in this slide show,
00:51:47.613 --> 00:51:52.554
but it's very easy to
derive by just looking at.
00:51:52.555 --> 00:51:55.764
Well, it turns out if you
take the expectation of
00:51:55.765 --> 00:51:59.525
the gradlog probability where
we're sampling from pi,
00:51:59.526 --> 00:52:01.814
this expectation comes out to 0.
00:52:01.815 --> 00:52:03.816
And you can see that, yeah,
00:52:03.817 --> 00:52:06.994
that's sort of a well
known fact in.
00:52:06.995 --> 00:52:11.947
So any choice of baseline
that we plug in here is
00:52:11.948 --> 00:52:14.874
gonna not introduce bias.
00:52:14.875 --> 00:52:18.185
And it turns out that a near
optimal choice to plug in here
00:52:18.186 --> 00:52:19.955
is just the expected return.
00:52:19.956 --> 00:52:23.970
Conditioned on
the state s sub t.
00:52:23.971 --> 00:52:26.570
And intuitively that's because
00:52:26.571 --> 00:52:30.050
we only want to increase
the probability of action t
00:52:30.051 --> 00:52:34.100
if the returns that you got
here were better than expected.
00:52:34.101 --> 00:52:39.345
So that means we should plug
in the expected return here.
00:52:39.346 --> 00:52:43.884
So actually it turns out this
isn't the optimal choice that
00:52:43.885 --> 00:52:46.230
minimizes variance.
00:52:46.231 --> 00:52:49.070
The optimal choice actually
involves some kind of weighted
00:52:49.071 --> 00:52:49.970
estimate of the return.
00:52:51.110 --> 00:52:56.401
As it turns out, it's near
optimal and it's as good,
00:52:56.402 --> 00:53:02.510
for all practical purposes,
it's the best you're gonna do.
00:53:02.511 --> 00:53:07.411
You can also, another way to
derive this is to just write out
00:53:07.412 --> 00:53:10.460
the formula for the variants.
00:53:10.461 --> 00:53:14.873
So the variants of x,
some random variable x, is just
00:53:14.874 --> 00:53:19.985
expectation of x squared minus
the expectation of x squared.
00:53:19.986 --> 00:53:24.762
So if you write that out, and
you make some approximations
00:53:24.763 --> 00:53:29.931
about things being independent
then you'll find that you just
00:53:29.932 --> 00:53:34.515
want to minimize the expectation
of this term squared,
00:53:34.516 --> 00:53:38.670
of some of the words
might based like squared.
00:53:38.671 --> 00:53:42.599
So that means you should
just take this base line to
00:53:42.600 --> 00:53:45.219
be the main of this quantity,
for
00:53:45.220 --> 00:53:48.697
the best possible less
to at this quantity.
00:53:51.716 --> 00:53:53.927
So that's a very useful and
00:53:53.928 --> 00:53:57.396
powerful variance
reduction technique.
00:53:57.397 --> 00:54:01.800
And that'll get you a lot
of mileage in practice.
00:54:03.540 --> 00:54:06.130
The other technique that
gets you a lot of mileage is
00:54:06.131 --> 00:54:07.390
to introduce discounts.
00:54:09.130 --> 00:54:10.980
How am I doing on time,
by the way?
00:54:10.981 --> 00:54:14.776
Does this session
actually ends now?
00:54:14.777 --> 00:54:15.397
>> Ten minutes.
00:54:15.398 --> 00:54:17.757
>> Okay, okay good.
00:54:17.758 --> 00:54:23.257
I'll finish up in the next
ten minutes, okay.
00:54:23.258 --> 00:54:26.234
So another method is gonna give
you a lot of mileage is to
00:54:26.235 --> 00:54:28.460
introduce discount factor.
00:54:28.461 --> 00:54:34.495
So, since so far I haven't
added any discounted factors,
00:54:34.496 --> 00:54:38.487
but and
I've as I said before working.
00:54:38.488 --> 00:54:41.750
I've been talking about
the episodic setting where you
00:54:41.751 --> 00:54:45.025
just care about the sum of
rewards over a whole episode.
00:54:45.026 --> 00:54:49.274
Sometimes, if you read
reinforcement learning books,
00:54:49.275 --> 00:54:53.701
you'll see a different kind of
problem setting where Where
00:54:53.702 --> 00:54:58.138
you're actually optimizing
a discounted sum of rewards.
00:54:58.139 --> 00:55:02.581
And I think this is
the reasonable thing to do.
00:55:02.582 --> 00:55:09.610
But often the time horizon
you care about is very long.
00:55:09.611 --> 00:55:12.920
So for example,
you might care about, well,
00:55:12.921 --> 00:55:16.870
in that Tetris benchmark I
was talking about earlier,
00:55:18.730 --> 00:55:21.100
the game might take
a million time steps.
00:55:21.101 --> 00:55:26.750
So that means if you want
a discount factor that actually
00:55:26.751 --> 00:55:30.230
includes all those [INAUDIBLE]
steps and counts all of them,
00:55:30.231 --> 00:55:40.084
it would need to be 0.99999,
right?
00:55:40.085 --> 00:55:43.354
So anyway, what the discount
factor is is it's a,
00:55:47.909 --> 00:55:51.632
So it's downgrade to
the effect of future rewards.
00:55:51.633 --> 00:55:53.803
Sorry, I didn't explain.
00:55:53.804 --> 00:55:57.883
In reinforcement learning
textbooks what you
00:55:57.884 --> 00:56:02.591
sometimes see is you care
about the sum of reward times
00:56:02.592 --> 00:56:05.430
this r sub t times
gamma to the t.
00:56:05.431 --> 00:56:12.672
So the problem with that is
that, is that that's basically
00:56:12.673 --> 00:56:16.690
going to ignore all rewards
that are far into the future.
00:56:16.691 --> 00:56:19.550
So I think that this
problem setting doesn't
00:56:19.551 --> 00:56:20.880
usually make sense.
00:56:20.881 --> 00:56:23.680
And I think it makes
more sense to think about
00:56:23.681 --> 00:56:26.650
the discount factor as
a variance reduction parameter.
00:56:26.651 --> 00:56:31.438
Or at least in the policy
grading context,
00:56:31.439 --> 00:56:35.960
what it does is it gives
you a [INAUDIBLE].
00:56:35.961 --> 00:56:40.664
So the way we're gonna introduce
the discount factor here
00:56:40.665 --> 00:56:44.792
[INAUDIBLE] going to modify
our previous formula for
00:56:44.793 --> 00:56:50.020
the policy gradient by
introducing the discount factor.
00:56:50.021 --> 00:56:53.760
So recall that our
previous formula,
00:56:53.761 --> 00:56:58.740
what we have is sum of
rewards minus baseline.
00:56:58.741 --> 00:57:03.580
So here, we're just gonna
modify it to be sum of rewards
00:57:03.581 --> 00:57:06.940
multiplied by this discount
factor, minus baseline.
00:57:06.941 --> 00:57:10.990
So this discount factor,
you can see that the discount
00:57:10.991 --> 00:57:16.270
factor measures the difference
in time between this reward and
00:57:16.271 --> 00:57:18.980
the action you're looking at.
00:57:18.981 --> 00:57:21.770
So it's based on
the assumption that
00:57:22.890 --> 00:57:27.130
the action only affects rewards
in the near future, but
00:57:27.131 --> 00:57:30.170
after that, the effect of
this action is washed out.
00:57:31.590 --> 00:57:32.862
And it's not going to,
00:57:32.863 --> 00:57:36.111
anything we can measure whether
this action was good or bad.
00:57:36.112 --> 00:57:40.698
We have the next by looking
at some short horizon
00:57:40.699 --> 00:57:42.537
after the action.
00:57:42.538 --> 00:57:45.679
But the system kind of forgets
00:57:45.680 --> 00:57:50.200
the effect of that
action after a while.
00:57:50.201 --> 00:57:54.421
And okay, so if we use this
discount factor then if we wanna
00:57:54.422 --> 00:57:58.126
then by the same arguments
I was providing before,
00:57:58.127 --> 00:58:01.745
we want the baseline to be
the expected discount at
00:58:01.746 --> 00:58:06.246
some of the rewards so that we
can maximally reduce variants.
00:58:06.247 --> 00:58:11.153
And the way you can also
think about this discount
00:58:11.154 --> 00:58:15.460
factor as it's doing
credit assignment by
00:58:15.461 --> 00:58:20.248
assuming that the action
deserves no credit for
00:58:20.249 --> 00:58:24.920
rewards that are far
into the future.
00:58:24.921 --> 00:58:28.595
So if your discount
factor is 0.99,
00:58:28.596 --> 00:58:32.595
then 100 time steps
later you're sort of
00:58:32.596 --> 00:58:37.583
downgrading everything by
a factor of 1/e and so on.
00:58:37.584 --> 00:58:43.984
So that's the idea of using
this discount factor, and
00:58:43.985 --> 00:58:51.650
now it's just worth looking at
the formulas we've shown so far.
00:58:51.651 --> 00:58:57.503
And one way to interpret them
is as multiplying gradlock
00:58:57.504 --> 00:59:03.940
probability times an estimate
of the advantage function.
00:59:03.941 --> 00:59:06.160
So I haven't defined
the advantage function yet
00:59:06.161 --> 00:59:10.070
but you can think of
the advantage function.
00:59:10.071 --> 00:59:12.618
What I mean by this is that
the advantage estimate is
00:59:12.619 --> 00:59:15.239
just telling you whether
the action was good or bad.
00:59:15.240 --> 00:59:19.205
So we're just multiplying
gradlock probability times some
00:59:19.206 --> 00:59:23.430
scalar which estimated whether
the action was good or bad.
00:59:23.431 --> 00:59:24.600
And in this case,
00:59:24.601 --> 00:59:28.570
we're just taking that scalar
to be the discounted sum of
00:59:28.571 --> 00:59:33.330
following rewards minus
the baseline, okay.
00:59:33.331 --> 00:59:39.057
So putting everything together
that we shown so far, what we
00:59:39.058 --> 00:59:44.687
get is this so-called Vanilla
Policy Grading Algorithm.
00:59:44.688 --> 00:59:47.174
And so
the way this works is iterate,
00:59:47.175 --> 00:59:50.626
first collect the set of
trajectories by executing
00:59:50.627 --> 00:59:53.516
the current policy
Then at each timestep,
00:59:53.517 --> 00:59:57.127
we compute the discounted
return, which is the sum of
00:59:57.128 --> 01:00:00.762
future awards down-weighted
by discount factors.
01:00:00.763 --> 01:00:03.981
And we compute
the advantage estimate,
01:00:03.982 --> 01:00:08.153
which is defined as this
return minus the baseline.
01:00:08.154 --> 01:00:11.342
Then, we refit
the baseline by doing,
01:00:11.343 --> 01:00:15.119
this is just a nonlinear
regression problem.
01:00:15.120 --> 01:00:18.474
So we can set up a least
squares optimization,
01:00:18.475 --> 01:00:21.910
where we fit the baseline
to the returns.
01:00:21.911 --> 01:00:25.260
And then we update the policy by
01:00:25.261 --> 01:00:30.240
computing a policy gradient
estimate which is the sum
01:00:30.241 --> 01:00:34.297
of the terms of the form
gridlock probability times.
01:00:34.298 --> 01:00:38.908
And actually there's
a slightly modified version of
01:00:38.909 --> 01:00:43.211
this which is sort of a more
modern version which is
01:00:43.212 --> 01:00:47.720
compatible with very
efficient implementations is
01:00:47.721 --> 01:00:52.445
what we call the Advantage
Actor-Critic algorithm.
01:00:52.446 --> 01:00:56.105
So there's this paper,
Asynchronous Methods For
01:00:56.106 --> 01:01:00.788
Deep Reinforcement Learning,
which introduced what's called
01:01:00.789 --> 01:01:05.228
the Asynchronous Advantage
Actor-Critic algorithm A3C.
01:01:05.229 --> 01:01:10.970
So that's the asynchronous part
doesn't really matter here.
01:01:10.971 --> 01:01:13.629
It's just sort of for
performance reasons.
01:01:13.630 --> 01:01:17.224
What I'm presenting here
is what could be called
01:01:17.225 --> 01:01:21.960
the Advantage Actor-Critic which
is sort of the vanilla policy
01:01:21.961 --> 01:01:26.172
gradient algorithm with some
modifications that make it
01:01:26.173 --> 01:01:29.956
more compatible with
efficient implementation.
01:01:29.957 --> 01:01:35.328
So the way it works is
now instead of collecting
01:01:35.329 --> 01:01:40.289
full trajectories,
you collect a certain
01:01:40.290 --> 01:01:45.403
fixed number of timesteps for
each update.
01:01:45.404 --> 01:01:49.689
So you have your agents and
it asks for
01:01:49.690 --> 01:01:53.986
T timesteps such
as 20 timesteps.
01:01:53.987 --> 01:01:58.899
And then for each timestep
along that trajectory segment,
01:01:58.900 --> 01:02:04.108
you compute the return and where
you cap it off with the estimate
01:02:04.109 --> 01:02:08.431
of the value function at
the end of the trajectory and
01:02:08.432 --> 01:02:13.360
you compute the advantage
function also at each timestep.
01:02:14.740 --> 01:02:19.120
And so we're going to use
the return as the target for
01:02:19.121 --> 01:02:21.480
the value function in
your regression problem.
01:02:21.481 --> 01:02:24.900
And value function is
synonymous with baseline,
01:02:24.901 --> 01:02:26.620
if you're not that
familiar with it.
01:02:26.621 --> 01:02:29.417
So v here can also be thought
of as the baseline v that
01:02:29.418 --> 01:02:31.446
we were using in
the previous slide.
01:02:31.447 --> 01:02:35.657
So here we're gonna use R as
the target value function,
01:02:35.658 --> 01:02:39.090
and A as the estimated
advantage function.
01:02:39.091 --> 01:02:42.929
And we just set up this loss
function which has our policy
01:02:42.930 --> 01:02:47.422
gradient loss, which is grad log
probability times advantage,
01:02:47.423 --> 01:02:50.866
and then also a value
function regression error.
01:02:50.867 --> 01:02:55.834
And this can be plugged into
a stochastic grading algorithm
01:02:55.835 --> 01:02:57.033
such as Adam.
01:02:57.034 --> 01:03:00.421
But this is exactly the same
thing we were doing before but
01:03:00.422 --> 01:03:03.807
it was, like something is just
a version that uses a fixed
01:03:03.808 --> 01:03:07.125
number of timesteps at once
instead of using collect and
01:03:07.126 --> 01:03:08.415
fold trajectories.
01:03:08.416 --> 01:03:14.746
Okay, and even though this
is very simple and basic,
01:03:14.747 --> 01:03:21.217
when you run it for a really
large number of timesteps,
01:03:21.218 --> 01:03:26.001
you can get pretty
impressive results,
01:03:26.002 --> 01:03:29.678
such as navigating a 3D maze.
01:03:29.679 --> 01:03:33.849
Just use learning a policy
which is a convolutional neural
01:03:33.850 --> 01:03:38.940
network which navigates a 3D
maze using only pixels as input.
01:03:38.941 --> 01:03:45.565
So this was from deep line,
about last year from Vlad and
01:03:45.566 --> 01:03:51.050
the collaborators, and
I have a video actually of that.
01:03:51.051 --> 01:03:53.670
So actually I'm out of time now.
01:03:53.671 --> 01:03:56.910
So I just want to show a couple
of videos just to sort of
01:03:58.530 --> 01:04:00.600
show what you can do
with these methods.
01:04:00.601 --> 01:04:05.500
So I'll show the video from
the [INAUDIBLE] player
01:04:05.501 --> 01:04:09.210
which is navigating a 3D
maze using pixels as input,
01:04:09.211 --> 01:04:10.860
using exactly this algorithm.
01:04:10.861 --> 01:04:14.937
And I'll also show a video of a,
01:04:14.938 --> 01:04:20.977
I didn't get to this
part of the presentation,
01:04:20.978 --> 01:04:26.413
but there's also some
methods you can do to
01:04:26.414 --> 01:04:32.610
get more information from
the data you collect.
01:04:32.611 --> 01:04:35.903
So you can do a better update
where instead of just doing
01:04:35.904 --> 01:04:39.268
a gradient update, you set up
a trust region problem and
01:04:39.269 --> 01:04:40.282
you solve that.
01:04:40.283 --> 01:04:44.996
So you set up a little
constraint optimization problem
01:04:44.997 --> 01:04:47.762
and you do an update on that,
and
01:04:47.763 --> 01:04:51.771
you end up getting better
sample efficiency.
01:04:51.772 --> 01:04:54.431
And also,
sort of more reliability.
01:04:54.432 --> 01:04:56.733
So I'm not gonna-
01:04:56.734 --> 01:04:58.518
>> [INAUDIBLE]
>> What?
01:04:58.519 --> 01:04:59.932
>> [INAUDIBLE]
>> Yeah, so
01:04:59.933 --> 01:05:05.422
I'll show the first video, and
then, so that's labelled 1-A3C.
01:05:05.423 --> 01:05:10.565
And then, the last thing that I
also wanna show the video that's
01:05:10.566 --> 01:05:15.041
E-PPO, which is sort of,
or actually show 2-CRPO,
01:05:15.042 --> 01:05:18.279
which is learning to
run in simulation,
01:05:18.280 --> 01:05:22.659
which is based on this trust
region kind of algorithm,
01:05:22.660 --> 01:05:25.930
which I didn't have
time to talk about.
01:05:25.931 --> 01:05:28.251
So Ian,
please show the first video.
01:05:32.147 --> 01:05:33.890
Okay, good.
01:05:33.891 --> 01:05:38.160
And then, once that's done,
start the 2-TRPO video.
01:05:38.161 --> 01:05:39.610
I can't see what you're showing,
01:05:39.611 --> 01:05:41.528
so [LAUGH] just tell
me when it's dark.
01:05:41.529 --> 01:05:42.671
>> Okay.
01:05:42.672 --> 01:05:43.690
>> Okay, cool.
01:05:56.571 --> 01:06:01.411
Okay, so this is a video
that collaborators and
01:06:01.412 --> 01:06:03.348
I were working on.
01:06:03.349 --> 01:06:08.708
Learning locomotion controllers
from scratch where you
01:06:08.709 --> 01:06:14.415
just have this 3D [INAUDIBLE]
simulator getting [INAUDIBLE].
01:06:14.416 --> 01:06:17.325
The input is just the joint
angles and joint velocities and
01:06:17.326 --> 01:06:20.359
so forth, and
the output is the joint works.
01:06:20.360 --> 01:06:23.377
And so it's a pretty low level
representation of everything.
01:06:23.378 --> 01:06:27.927
And it starts off just falling
over instantly, but eventually,
01:06:27.928 --> 01:06:32.075
it starts falling over a little
bit more slowly, and so on.
01:06:32.076 --> 01:06:34.961
And so finally,
it's actually running and
01:06:34.962 --> 01:06:39.429
it can actually run indefinitely
long after learning is complete.
01:06:39.430 --> 01:06:43.920
And then,
01:06:43.921 --> 01:06:46.610
there's nothing in that
algorithm specific to running.
01:06:46.611 --> 01:06:51.102
So you can also use it on
a four-legged creature and
01:06:53.402 --> 01:06:55.848
You can also use it on learning
how to get up off the ground.
01:06:55.849 --> 01:06:58.590
So we defined, I don't know if
we've gotten up to that part
01:06:58.591 --> 01:07:01.360
yet, but we defined
the reward function which is
01:07:01.361 --> 01:07:05.300
something like distance of your
head from some target height.
01:07:05.301 --> 01:07:09.907
And then, it'll learn how to do
some kind of ballistic motion
01:07:09.908 --> 01:07:12.839
with its arms and
get up off the ground.
01:07:12.840 --> 01:07:21.475
And so people have previously
done learned locomotor controls.
01:07:21.476 --> 01:07:24.254
So we weren't the first
people to do that, but
01:07:24.255 --> 01:07:27.749
previous attempts have usually
used much more hands coded
01:07:27.750 --> 01:07:30.887
representations where you
specify, for example,
01:07:30.888 --> 01:07:35.042
that you're trying to place your
feet left, right, left, right.
01:07:35.043 --> 01:07:38.921
And your network is just
determining where to put your
01:07:38.922 --> 01:07:40.480
foot, for example.
01:07:41.480 --> 01:07:44.410
Whereas we just had
a feet forward network
01:07:44.411 --> 01:07:47.134
mapping from observations
to actions directly.
01:07:47.135 --> 01:07:51.710
So yeah, so this was a video
01:07:51.711 --> 01:07:56.500
produced by my collaborator
recently where it's just using
01:07:59.020 --> 01:08:02.230
So basically you've got
this guy running around and
01:08:02.231 --> 01:08:04.280
being hit in the head by bricks.
01:08:04.281 --> 01:08:07.000
And he still manages to
get up off the ground.
01:08:07.001 --> 01:08:08.940
And he's running
towards the red ball.
01:08:08.941 --> 01:08:11.000
So you see these red
balls randomly appearing.
01:08:11.001 --> 01:08:13.950
So here your reward function
is not just to run forward.
01:08:13.951 --> 01:08:16.010
But it's to run
towards the ball.
01:08:16.011 --> 01:08:18.090
And the balls randomly appear.
01:08:18.091 --> 01:08:22.050
So I mean, when you see this I
think it becomes more possible
01:08:22.051 --> 01:08:27.270
you could actually get
football or something if
01:08:27.271 --> 01:08:33.268
you really put these
skills together.
01:08:33.269 --> 01:08:36.550
So i'm pretty excited about
what we can do with that.
01:08:36.551 --> 01:08:39.900
That's using something called,
01:08:39.901 --> 01:08:42.780
we call it proximal
policy optimization.
01:08:42.781 --> 01:08:48.550
It's basically, it's like trust
region policy optimization,
01:08:48.551 --> 01:08:51.050
but you use a soft constrain
instead of a hard constrain,
01:08:51.051 --> 01:08:51.980
and actually,
01:08:51.981 --> 01:08:54.940
I'm gonna put out a paper on
that in the next week or so.
01:08:54.941 --> 01:09:02.100
So you can check out So it'll
be on the archive in a few days.
01:09:02.101 --> 01:09:05.070
So yeah, I wish I could tell
you about more of this stuff,
01:09:05.071 --> 01:09:08.500
but we're out of time,
and it's late over there.
01:09:08.501 --> 01:09:11.865
So hope you all
enjoy your evening.
01:09:11.866 --> 01:09:19.130
>> [APPLAUSE]