WEBVTT

00:00:00.000 --> 00:00:01.680
>> Now, it's time to split our data

00:00:01.680 --> 00:00:03.780
into training data and testing data.

00:00:03.780 --> 00:00:06.375
I like to think about
this like an exam.

00:00:06.375 --> 00:00:08.700
Training data is like
a practice exam.

00:00:08.700 --> 00:00:10.260
The questions aren't exactly the

00:00:10.260 --> 00:00:12.090
same as they are going
to be on the test,

00:00:12.090 --> 00:00:14.805
but if you practice and
you get those right,

00:00:14.805 --> 00:00:17.520
you're more likely to get
test questions right.

00:00:17.520 --> 00:00:21.210
Similarly, the model will use
the training data to learn,

00:00:21.210 --> 00:00:23.970
and then we'll send the
testing data to the model,

00:00:23.970 --> 00:00:25.395
and it'll try to predict.

00:00:25.395 --> 00:00:28.275
We will compare the
predictions from the model

00:00:28.275 --> 00:00:31.020
with the testing data to score it,

00:00:31.020 --> 00:00:32.940
just like an instructor might use

00:00:32.940 --> 00:00:35.760
an exam key to see how
well you did on an exam.

00:00:35.760 --> 00:00:37.740
So let's get started splitting

00:00:37.740 --> 00:00:40.260
our data between training
data and testing data.

00:00:40.260 --> 00:00:42.530
As always, additional resources are

00:00:42.530 --> 00:00:45.690
linked on the screen and down
in the description below.

00:00:45.920 --> 00:00:50.310
This is actually fairly simple
inside of a Jupyter Notebook.

00:00:50.310 --> 00:00:53.900
We're going to create a
local variable called train,

00:00:53.900 --> 00:00:58.430
and this will be all of our
data from before August 31st,

00:00:58.430 --> 00:01:01.885
2012, inclusive of that date.

00:01:01.885 --> 00:01:03.860
To make it easy to work with,

00:01:03.860 --> 00:01:06.965
we're going to convert that
into a Pandas DataFrame.

00:01:06.965 --> 00:01:10.190
Now, why did we choose August 31st?

00:01:10.190 --> 00:01:12.035
Well, if we look back at our data,

00:01:12.035 --> 00:01:15.680
we can see that it starts
on January 1st, 2011,

00:01:15.680 --> 00:01:22.520
and we have data from every single
day until December 31st, 2012.

00:01:22.520 --> 00:01:26.240
So by choosing August 31st, 2012,

00:01:26.240 --> 00:01:28.220
we're choosing 75 percent of

00:01:28.220 --> 00:01:30.625
our data to be used
as our training set.

00:01:30.625 --> 00:01:33.585
Basically, we just want to
take the other 25 percent,

00:01:33.585 --> 00:01:35.925
and save that for our testing set.

00:01:35.925 --> 00:01:39.470
Let's run this cell to verify
that the last five rows of

00:01:39.470 --> 00:01:42.020
this DataFrame are
the last five days of

00:01:42.020 --> 00:01:45.025
August 2012. This looks right.

00:01:45.025 --> 00:01:46.685
Now for our testing data,

00:01:46.685 --> 00:01:48.815
we're essentially going
to do the same thing,

00:01:48.815 --> 00:01:52.265
but this time we're going
to start at September 1st,

00:01:52.265 --> 00:01:57.250
2012, and we want to be inclusive
of that boundary as well.

00:01:57.250 --> 00:01:59.735
Notice that we've changed
a couple of things here.

00:01:59.735 --> 00:02:04.620
One is that we want all of the
rows after the date September 1st,

00:02:04.620 --> 00:02:07.190
2012, versus in our training data,

00:02:07.190 --> 00:02:09.820
we wanted all of the dates before,

00:02:09.820 --> 00:02:11.690
and we're also going to print

00:02:11.690 --> 00:02:14.195
the first five rows of this DataFrame

00:02:14.195 --> 00:02:16.190
rather than the last five to make

00:02:16.190 --> 00:02:18.650
sure that we're getting the
first five days of September.

00:02:18.650 --> 00:02:20.900
This is looking good.
It looks like we've got

00:02:20.900 --> 00:02:24.480
our training data and our
testing data ready to go.

