Introducing the Database Experimentation Assistant
It is hard to believe that it has been almost 2 years since we last had Maxim on our show, but I can tell you we are extremely excited he's back. Maxim is a Senior Program Manager in the Big Data team at Microsoft and he's back to talk about Interactive Spark on Azure.
Maxim begins our discussion by walking us through the process and challenges data scientists go through when processing data. He explains that data science is an iterative process but that typically their productivity is not efficient because they spend a lot of time waiting for jobs to complete. One of the big factors, Maxim explains, is the size and cleanliness of data which contributes to the long wait times.
At the [05:20] mark Maxim shows us how Spark on Azure provides a solution to this problem by limiting the length of iterations, thus helping you be more productive. Maxim walks us through how that is accomplished. He first introduces is to Apache Spark, and then discusses how Spark on Azure makes data exploration even better.
At the [08:38] mark its DEMO TIME, where Maxim spends a few minutes showing us how to spin up a Spark HDInsight cluster, then spends the remaining 10 minutes demoing how to use Spark in HDInsight to execute jobs efficiently. I won't give anything away here, so be sure to watch to see Maxim work his Spark magic! Awesome show!
We definitely look forward to having him back!
I am not very familiar with Microsoft Azure, as it is learning online, in which I prefer one-on-one guidance.
Tried to use this dataset but the count is about 128k and not even close to the billion...
Am I doing something wrong?
@Luis are you sure you are not counting the sample data set size?
You may be doing: select count(*) from taxi_trips_full but that could be of the sample data set. Instead try select count(*) from taxi_raw
Hope this helps, or let me know if you found the solution.
This conversation has been locked by the site admins. No new comments can be made.