Interactive Spark on Azure

Play Interactive Spark on Azure

Description

It is hard to believe that it has been almost 2 years since we last had Maxim on our show, but I can tell you we are extremely excited he's back. Maxim is a Senior Program Manager in the Big Data team at Microsoft and he's back to talk about Interactive Spark on Azure.

Maxim begins our discussion by walking us through the process and challenges data scientists go through when processing data. He explains that data science is an iterative process but that typically their productivity is not efficient because they spend a lot of time waiting for jobs to complete. One of the big factors, Maxim explains, is the size and cleanliness of data which contributes to the long wait times.

At the [05:20] mark Maxim shows us how Spark on Azure provides a solution to this problem by limiting the length of iterations, thus helping you be more productive. Maxim walks us through how that is accomplished. He first introduces is to Apache Spark, and then discusses how Spark on Azure makes data exploration even better.

At the [08:38] mark its DEMO TIME, where Maxim spends a few minutes showing us how to spin up a Spark HDInsight cluster, then spends the remaining 10 minutes demoing how to use Spark in HDInsight to execute jobs efficiently. I won't give anything away here, so be sure to watch to see Maxim work his Spark magic! Awesome show!

We definitely look forward to having him back!

 

Tags:

Azure, spark, hadoop

Embed

Download

The Discussion

  • User profile image
    Amber

    I am not very familiar with Microsoft Azure, as it is learning online, in which I prefer one-on-one guidance.

  • User profile image
    Luis Simoes

    Tried to use this dataset but the count is about 128k and not even close to the billion...
    Am I doing something wrong?

  • User profile image
    stanleyjohns

    @Luis are you sure you are not counting the sample data set size?

    You may be doing: select count(*) from taxi_trips_full  but that could be of the sample data set. Instead try select count(*) from taxi_raw

    Hope this helps, or let me know if you found the solution.

Conversation locked

This conversation has been locked by the site admins. No new comments can be made.