Setting up and Getting Started with Power BI Embedded
This week's Data Exposed show welcomes back Maxim Lukiyanov to talk more about Spark performance tuning with Spark 2.x. Maxim is a Senior PM on the big data HDInsight team and is in the studio today to present Part 2 of his 4-part series.
Topics in today's video:
[01:40] - DataSets vs. DataFrames vs. RDDs
[10:45] - Garbage Collection Overhead and Executor Size
[18:20] - Data Formats
[22:35] - Data Partitioning
[26:25] - Caching
Be sure to follow the Data Exposed show on Twitter at @DataExposed!
If I recall correctly, DataFrames were introduced in Spark 1.3 as a "preview", but they are are definitely there in "production" mode in 1.6.
Also, Spark is not all that great in handling Hive tables with hundreds and thousands of partitions, though it presumably got better with Spark 2.1 release.
Hi Maxim,
just to clarify...the talk about caching (around 30 mins) is talking about techniques for caching data prior to processing as opposed to RDD caching (aka .persist() ), correct?
the reason for asking is that I've directed some of my colleagues here and I don't want there to be any confusion such as "errr...I thought Spark had an inbuilt caching feature, why is Maxim telling me not to use it?" I think it's important to clarify that you're taking about a different kind of caching here.
@jamiet: Hi Jamiet, .persist(), .cache() and "CACHE TABLE foo" are different ways to use the same native spark caching methods. Native caching in spark can be used and is effective, especially in the ETL pipelines where you need to cache intermediate results. But you need to keep in mind that native caching doesn't work well with partitioned tables. Therefore more generic and reliable caching technique is storage layer caching.