WEBVTT

00:00:02.000 --> 00:00:05.040
>> Hi everyone, and welcome back to

00:00:05.040 --> 00:00:07.890
Developers Introduction
to Data Science.

00:00:07.890 --> 00:00:10.560
Data science machine learning and AI,

00:00:10.560 --> 00:00:14.250
are all critical, but how can
you get started with them?

00:00:14.250 --> 00:00:16.110
In this video, we are going to learn

00:00:16.110 --> 00:00:18.660
what the data science lifecycle is.

00:00:18.660 --> 00:00:22.560
The lifecycle is designed for
data science projects that are

00:00:22.560 --> 00:00:27.480
intended to shape as part of
your intelligent applications.

00:00:27.480 --> 00:00:30.150
The data science lifecycle is

00:00:30.150 --> 00:00:32.850
composed of five major
stages that are;

00:00:32.850 --> 00:00:36.540
business understanding, data
acquisition and understanding,

00:00:36.540 --> 00:00:40.110
modeling, deployment,
and customer acceptance.

00:00:40.110 --> 00:00:43.245
Let's start with
business understanding.

00:00:43.245 --> 00:00:46.035
Here there are two main goals.

00:00:46.035 --> 00:00:49.310
The first one is about
defining the objectives.

00:00:49.310 --> 00:00:50.570
You need to work with

00:00:50.570 --> 00:00:53.540
your customers and other
stakeholders to understand,

00:00:53.540 --> 00:00:55.625
identify the business problems.

00:00:55.625 --> 00:00:59.610
The second goal is about
identifying data sources.

00:00:59.610 --> 00:01:02.600
You need to find the relevant
data that helps you answer

00:01:02.600 --> 00:01:04.310
the question that define

00:01:04.310 --> 00:01:07.310
the objective of your
data science project.

00:01:07.310 --> 00:01:11.030
After this, we have data
acquisition and understanding.

00:01:11.030 --> 00:01:15.800
The goals here are to produce
a clean, high-quality dataset,

00:01:15.800 --> 00:01:18.425
and to develop a
solution architecture of

00:01:18.425 --> 00:01:22.375
the data pipeline that
refreshes and scores your data.

00:01:22.375 --> 00:01:24.090
There are three main steps,

00:01:24.090 --> 00:01:26.280
as you can see. Ingest the data.

00:01:26.280 --> 00:01:28.310
Here you need to
ingest your data into

00:01:28.310 --> 00:01:31.535
the target analytic environment
that you're going to use,

00:01:31.535 --> 00:01:34.460
then you need to explore
the data to determine if

00:01:34.460 --> 00:01:37.340
the data quality is good
enough to answer the question,

00:01:37.340 --> 00:01:39.230
and finally, you need to set up

00:01:39.230 --> 00:01:43.055
a data pipeline to score
new and refresh data.

00:01:43.055 --> 00:01:46.275
After this, there is
the modeling stage.

00:01:46.275 --> 00:01:50.250
Here the main goal are
feature engineering,

00:01:50.250 --> 00:01:52.550
you need to create the
data features from

00:01:52.550 --> 00:01:55.250
the raw data to facilitate
the model training.

00:01:55.250 --> 00:01:58.190
Model training, you need
to find the model that

00:01:58.190 --> 00:02:01.390
answer the question in
a very accurate way,

00:02:01.390 --> 00:02:03.395
and also you need to compare

00:02:03.395 --> 00:02:05.780
different success metrics in order to

00:02:05.780 --> 00:02:08.420
understand what's the best
model for your solution,

00:02:08.420 --> 00:02:10.520
and finally, you need to determine if

00:02:10.520 --> 00:02:13.190
your model is suitable
for production,

00:02:13.190 --> 00:02:15.950
is ready to be deployed.

00:02:15.950 --> 00:02:19.470
Finally, there is deployment.

00:02:19.470 --> 00:02:22.880
Here we need to deploy the
model and the pipeline to

00:02:22.880 --> 00:02:26.360
a production environment of
application consumption.

00:02:26.360 --> 00:02:27.860
To deploy your models,

00:02:27.860 --> 00:02:31.775
you need to expose them
with an open API interface.

00:02:31.775 --> 00:02:34.505
The interface enables the model to be

00:02:34.505 --> 00:02:37.670
easily consumed from different
types of applications.

00:02:37.670 --> 00:02:41.250
Some example of these
application are online website,

00:02:41.250 --> 00:02:45.425
spreadsheet, dashboard,
back-end applications.

00:02:45.425 --> 00:02:50.405
After this, you need to finalize
your project deliverables.

00:02:50.405 --> 00:02:53.480
You need to confirm that
the pipeline, the model,

00:02:53.480 --> 00:02:56.925
and their deployment in a
production environment to satisfy,

00:02:56.925 --> 00:03:01.345
of course, also your customers
or stakeholders objectives.

00:03:01.345 --> 00:03:03.800
You can learn more about

00:03:03.800 --> 00:03:08.610
the data science lifecycle at
aka.ms/datasciencelifecycle.

