Spark Performance Tuning - Part 2

Sign in to queue

Description

This week's Data Exposed show welcomes back Maxim Lukiyanov to talk more about Spark performance tuning with Spark 2.x. Maxim is a Senior PM on the big data HDInsight team and is in the studio today to present Part 2 of his 4-part series.

Topics in today's video:

[01:40] - DataSets vs. DataFrames vs. RDDs

[10:45] - Garbage Collection Overhead and Executor Size

[18:20] - Data Formats  

[22:35] - Data Partitioning

[26:25] - Caching

Be sure to follow the Data Exposed show on Twitter at @DataExposed!

Embed

Download

Download this episode

The Discussion

  • User profile image
    sokhaty

    If I recall correctly, DataFrames were introduced in Spark 1.3 as a "preview", but they are are definitely there in "production" mode in 1.6.

    Also, Spark is not all that great in handling Hive tables with hundreds and thousands of partitions, though it presumably got better with Spark 2.1 release.

  • User profile image
    jamiet

    Hi Maxim,

    just to clarify...the talk about caching (around 30 mins) is talking about techniques for caching data prior to processing as opposed to RDD caching (aka .persist() ), correct?

     

    the reason for asking is that I've directed some of my colleagues here and I don't want there to be any confusion such as "errr...I thought Spark had an inbuilt caching feature, why is Maxim telling me not to use it?" I think it's important to clarify that you're taking about a different kind of caching here.

  • User profile image
    maxluk

    @jamiet: Hi Jamiet, .persist(), .cache() and "CACHE TABLE foo" are different ways to use the same native spark caching methods. Native caching in spark can be used and is effective, especially in the ETL pipelines where you need to cache intermediate results. But you need to keep in mind that native caching doesn't work well with partitioned tables. Therefore more generic and reliable caching technique is storage layer caching.

Add Your 2 Cents