Tech Off Post

Single Post Permalink

View Thread: Impact of changing block size in Hadoop HDFS:
  • User profile image
    Sven Groot

    There are a number of things that this impacts. Most obviously, a file will have fewer blocks if the block size is larger. This can potentially make it possible for client to read/write more data without interacting with the Namenode, and it also reduces the metadata size of the Namenode, reducing Namenode load (this can be an important consideration for extremely large file systems).

    With fewer blocks, the file may potentially be stored on fewer nodes in total; this can reduce total throughput for parallel access, and make it more difficult for the MapReduce scheduler to schedule data-local tasks.

    When using such a file as input for MapReduce (and not constraining the maximum split size to be smaller than the block size), it will reduce the number of tasks which can decrease overhead. But having fewer, longer tasks also means you may not gain maximum parallelism (if there are fewer tasks than your cluster can run simultaneously), increase the chance of stragglers, and if a task fails, more work needs to be redone. Increasing the amount of data processed per task can also cause additional read/write operations (for example, if a map task changes from having only one spill to having multiple and thus needing a merge at the end).

    Usually, it depends on the input data. If you want to maximize throughput for a very large input file, using very large blocks (128MB or even 256MB) is best. For smaller files, using a smaller block size is better. Note that you can have files with different block sizes on the same file system by changing the dfs.block.size parameter when the file is written, e.g. when uploading using the command line tools: "bin/hadoop fs -put localpath dfspath -Ddfs.block.size=33554432"

    The default block size if none is specified in hdfs-site.xml is 64MB.