Spark Performance Tuning - Part 2

Play Spark Performance Tuning - Part 2

The Discussion

  • User profile image

    If I recall correctly, DataFrames were introduced in Spark 1.3 as a "preview", but they are are definitely there in "production" mode in 1.6.

    Also, Spark is not all that great in handling Hive tables with hundreds and thousands of partitions, though it presumably got better with Spark 2.1 release.

  • User profile image

    Hi Maxim,

    just to clarify...the talk about caching (around 30 mins) is talking about techniques for caching data prior to processing as opposed to RDD caching (aka .persist() ), correct?


    the reason for asking is that I've directed some of my colleagues here and I don't want there to be any confusion such as "errr...I thought Spark had an inbuilt caching feature, why is Maxim telling me not to use it?" I think it's important to clarify that you're taking about a different kind of caching here.

  • User profile image

    @jamiet: Hi Jamiet, .persist(), .cache() and "CACHE TABLE foo" are different ways to use the same native spark caching methods. Native caching in spark can be used and is effective, especially in the ETL pipelines where you need to cache intermediate results. But you need to keep in mind that native caching doesn't work well with partitioned tables. Therefore more generic and reliable caching technique is storage layer caching.

Add Your 2 Cents