Optimizing Hive in Azure HDInsight
This week's episode of Data Exposed welcomes Rashim Gupta back to the show to talk about how to get better performance from your Hive queries in HDInsight. Rashim is Principal PM in the Big Data group and discusses 5 areas of focus that, depending on the workload, need to be turned and give you the biggest bang for the buck in performance.
Rashim begins with Compression at the [01:50] mark, highlighting the motivation for looking at compression as Hadoop jobs are usually the I/O bottleneck. Rashim discusses the different compression algorithms and the pro's and cons for each.
Partitioning is discussed next at the [04:05] mark, and Rashim goes into the details into why partitioning is important and how it works to improve performance. Rashim also highlights a few best practices when partitioning to get the optimum benefit out of partitioning.
Up next is the ORC File Format at [07:00], and Rashim provides a quick overview of ORC, discusses how the ORC file format differs from other file formats and how it is suited for performance with Hive. We look briefly at the ORC file structure and how it is stored on the file system.
Join Optimization is up next at [09:50] and Rashim spends some good time on how joins work in Hive, and then we look at the different join types that Hive supports, when and how they should be used, and specific settings for each to get the best performance.
We wrap up discussing Reducers at [12:21] and Rashim provides an excellent example of how the number of reducers are calculated and what you can do to improve performance by manually modifying the number of reducers that Hive uses.
This conversation has been locked by the site admins. No new comments can be made.