WEBVTT
00:00:01.490 --> 00:00:05.010
>> So we've seen how after
we've trained a model,
00:00:05.010 --> 00:00:07.475
we can compare our test
to our predicted data.
00:00:07.475 --> 00:00:09.540
But the first time we did it,
00:00:09.540 --> 00:00:12.240
when we looked at it, we
just saw a list of values,
00:00:12.240 --> 00:00:15.735
and it's not really practical
to visually look at
00:00:15.735 --> 00:00:19.440
all the results from our
actual test data results,
00:00:19.440 --> 00:00:21.060
the actual values predicted
00:00:21.060 --> 00:00:23.010
versus the values
predicted by our model,
00:00:23.010 --> 00:00:24.690
and compare them row by row by row.
00:00:24.690 --> 00:00:27.720
Visually, it's just not a
practical way to look at things.
00:00:27.720 --> 00:00:29.540
So what we'll typically do is,
00:00:29.540 --> 00:00:31.860
we take advantage of the
fact that we're programmers,
00:00:31.860 --> 00:00:35.070
and we use some code to do
calculations to get a sense of
00:00:35.070 --> 00:00:37.140
the overall accuracy of how
00:00:37.140 --> 00:00:39.385
effective our model is
in making predictions.
00:00:39.385 --> 00:00:41.330
But it still comes down to comparing
00:00:41.330 --> 00:00:44.315
the predicted values
and the actual values.
00:00:44.315 --> 00:00:46.070
But the difference is,
what we want to do is use
00:00:46.070 --> 00:00:48.685
some calculations to compare the two.
00:00:48.685 --> 00:00:50.840
One of the most common methods we
00:00:50.840 --> 00:00:52.760
use is something called
Mean Squared Error.
00:00:52.760 --> 00:00:54.050
So the Mean Squared Error,
00:00:54.050 --> 00:00:55.520
if you actually look
at the formula for it,
00:00:55.520 --> 00:00:58.910
is just the mean of
the actual values,
00:00:58.910 --> 00:01:01.375
minus the predicted values squared.
00:01:01.375 --> 00:01:04.385
Now, I could write some
Python code, in fact,
00:01:04.385 --> 00:01:07.250
you learned enough in the
introduction to Python course to
00:01:07.250 --> 00:01:10.460
write a loop that would go
through all the actual values,
00:01:10.460 --> 00:01:11.870
go through all the predicted values,
00:01:11.870 --> 00:01:13.580
subtract one from the other,
00:01:13.580 --> 00:01:15.150
calculate the value squared,
00:01:15.150 --> 00:01:16.710
and perform this calculation.
00:01:16.710 --> 00:01:18.030
You could do that,
00:01:18.030 --> 00:01:20.575
but there's a better way.
00:01:20.575 --> 00:01:23.030
Luckily, there's a whole
bunch of great libraries
00:01:23.030 --> 00:01:25.475
out there that will help you
when you're doing data science.
00:01:25.475 --> 00:01:27.830
The scikit-learn library,
has all sorts of
00:01:27.830 --> 00:01:30.604
great functions for
scientific calculations,
00:01:30.604 --> 00:01:33.115
including one called
Mean Squared Error.
00:01:33.115 --> 00:01:37.135
So all I really have to do is
import the scikit-learn library,
00:01:37.135 --> 00:01:39.350
in particular, when you're doing
these types of calculations,
00:01:39.350 --> 00:01:40.790
you'll probably want the metrics,
00:01:40.790 --> 00:01:43.940
and then I just say calculate
the Mean Squared Error of
00:01:43.940 --> 00:01:47.120
my actual results versus
my predictive results,
00:01:47.120 --> 00:01:51.350
and now I can get a sense of my
total accuracy of the model.
00:01:51.350 --> 00:01:53.705
Generally speaking, just from
a data science perspective,
00:01:53.705 --> 00:01:55.495
a lower value is going to be better.
00:01:55.495 --> 00:01:57.480
Lower error is good.
00:01:57.480 --> 00:02:00.710
Now, sometimes there's a whole bunch
00:02:00.710 --> 00:02:02.825
of different numbers and
metrics you can look at,
00:02:02.825 --> 00:02:04.835
another one is the Root
Mean of Squared Error,
00:02:04.835 --> 00:02:06.605
which is just the square root of
00:02:06.605 --> 00:02:09.010
Mean Squared Error, which
we just calculated.
00:02:09.010 --> 00:02:13.325
But scikit-learn doesn't have a
method we can use to calculate it.
00:02:13.325 --> 00:02:15.365
So there's another library
00:02:15.365 --> 00:02:17.090
you're going to start
exploring with and playing
00:02:17.090 --> 00:02:20.905
with when you do data science
as well, which is NumPy.
00:02:20.905 --> 00:02:24.050
So what NumPy will do
is NumPy has all sorts
00:02:24.050 --> 00:02:27.140
of functions designed for
straight math calculations,
00:02:27.140 --> 00:02:29.840
and including one which
calculates the square root.
00:02:29.840 --> 00:02:34.280
So if I have the Mean Squared
Error method in scikit-learn,
00:02:34.280 --> 00:02:36.935
and I have the ability to
calculate square root with NumPy,
00:02:36.935 --> 00:02:41.855
then I can put those together to
get my Root Mean Squared Error.
00:02:41.855 --> 00:02:44.810
So these are the two libraries
that are really going to
00:02:44.810 --> 00:02:46.055
help you when you're looking to
00:02:46.055 --> 00:02:47.810
evaluate the accuracy of your model.
00:02:47.810 --> 00:02:49.370
Different types of
models you're going to
00:02:49.370 --> 00:02:51.020
learn as you explore data science,
00:02:51.020 --> 00:02:52.685
are going to have
different metrics that you
00:02:52.685 --> 00:02:55.805
evaluate to check the accuracy.
00:02:55.805 --> 00:02:58.310
But generally speaking, between
NumPy and scikit-learn,
00:02:58.310 --> 00:03:00.290
you should always
have some method out
00:03:00.290 --> 00:03:02.600
there that's going to help you
perform those calculations.
00:03:02.600 --> 00:03:05.765
So NumPy for the basic
mathematical calculations,
00:03:05.765 --> 00:03:07.955
and scikit-learn often has a lot of
00:03:07.955 --> 00:03:10.670
specific methods for predicting
00:03:10.670 --> 00:03:12.395
and measuring accuracy
of your models.
00:03:12.395 --> 00:03:16.440
So now let's go take a look
at that in some actual code.