On 16/10/11 02:53, Bharath Ravi wrote:
Hi all,
I have a question about how HDFS load balances requests for files/blocks:
HDFS currently distributes data blocks randomly, for balance.
However, if certain files/blocks are more popular than others, some nodes
might get an "unfair" number of requests.
Adding more replicas for these popular files might not help, unless HDFS
explicitly distributes requests fairly among the replicas.
Have a look at the ReplicationTargetChooser class; it does take datanode
load into account, though it's concern is distribution for data
availability, not performance.
The standard technique for popular files -including MR job JAR files- is
to over-replicate. One problem: how to determine what is popular without
adding more load on the namenode