Setting up SQL Server High-Availability between Windows and Linux with SQL Server 2017
This week's Data Exposed show welcomes back Maxim Lukiyanov to talk more about Spark performance tuning with Spark 2.x. Maxim is a Senior PM on the big data HDInsight team and is in the studio today to present the final part of his 4-part series.
Topics in today's video:
[00:45] - Intro
[02:15] - Advanced Partitioning and Bucketing
[10:30] - Advanced Joins: Joining Large Tables
[19:00] - Debugging and Recap
Spark 2.2 rc4 on Azure HDInsight: Script action https://github.com/hdinsight/script-actions/tree/master/install-spark2-2
The install link to Spark 2.2 rc4 on Azure HDInsight Github repo is wrong. The correct link is https://github.com/hdinsight/script-actions/tree/master/install-spark2-2
@runamuk00:Thank you. The link is fixed.
Thank you for this great series ! Would be great to see how bucketing is coded in pyspark. Do you have some examples ?
@Juanita:Hi sure, Spark docs have good set of examples. Click on Python tab and you will see them in python. First example shows good basic use case, third example shows using partitioning and bucketing together.