Hello all, Right now HDFS is still using simple replication to increase data reliability. Even though it works, it just wastes the disk space, network and disk bandwidth. For data-intensive applications (that needs to write large result to the HDFS), it just limits the throughput of MapReduce. Also it's very energy-inefficient.
Is the community trying to use erasure code to increase data reliability? I know someone is working on HDFS-RAID, but it can only solve the problem in disk space. In many case, network and disk bandwidth are more important, which are the factors limiting the throughput of MapReduce. Has anyone tried to use erasure code to reduce the size of data when data is written to HDFS? I know reducing replications might hurt the read performance, but I think it's still important to reduce the data size writing to HDFS. Thanks, Da