replication in HDFS

Zheng Da Wed, 12 Oct 2011 12:47:04 -0700

Hello all,

Right now HDFS is still using simple replication to increase data
reliability. Even though it works, it just wastes the disk space,
network and disk bandwidth. For data-intensive applications (that
needs to write large result to the HDFS), it just limits the
throughput of MapReduce. Also it's very energy-inefficient.


Is the community trying to use erasure code to increase data
reliability? I know someone is working on HDFS-RAID, but it can only
solve the problem in disk space. In many case, network and disk
bandwidth are more important, which are the factors limiting the
throughput of MapReduce. Has anyone tried to use erasure code to
reduce the size of data when data is written to HDFS? I know reducing
replications might hurt the read performance, but I think it's still
important to reduce the data size writing to HDFS.

Thanks,
Da

replication in HDFS

Reply via email to