On 3/20/12 5:37 PM, "Eli Collins" <e...@cloudera.com> wrote:
>Append introduces non-trivial design and code complexity, which is not >worth the cost if we don't have real users. Removing append means we >have the property that HDFS blocks, when finalized, are immutable. >This significantly simplifies the design and code, which significantly >simplifies the implementation of other features like snapshots, >HDFS-level caching, dedupe, etc. The above is related the critical design flaw in HDFS that makes it more complicated than necessary. Immutable files on a node can be combined with append with copy-on-write semantics if the blocks are small enough. But small blocks are not going to work with this flaw. This flaw is the definition of a block. It is conflated, being is two things at once: # An immutable segment of data that the file system tracks. # The segment of data that is contiguous on an individual data node. The first in any sane file system is a constant length. The second need not be. File systems like Ext4 and XFS use extents to map ranges of blocks to contiguous regions on disk. Then, they need only track these extents rather than all the fine grained detail of each block. The equivalent of a block report is then an extent report. HDFS does not have extents, and this causes extreme pressure to have large blocks for two well known reasons: reduction in filesystem state data, and larger data batches for Mappers. With extents, both of these pressures apply to extent sizes instead of block sizes. Blocks can be small, extents larger. Blocks can be immutable with copy-on-write for appends, truncate, and even random write. Others have already implemented the above in other distributed file systems. But when mentioned here in the past it seemed to be ignored or misunderstood: http://mail-archives.apache.org/mod_mbox/hadoop-general/201110.mbox/%3C1318 437111.16477.228.camel@thinkpad%3E The response to that was disappointing -- the extent concept did not seem to be comprehended, and none of the good ideas from the links provided got discussed. I personally NEED append for some of my work and had been planning on using it in 0.23. However I recognize that even more than that I can't risk losing data for my append use case. If append is too hard and complicated to bolt on to HDFS, perhaps a bigger re-think is required so that such features are not so complicated and a better natural fit to the design.