On 3/20/12 5:37 PM, "Eli Collins" <e...@cloudera.com> wrote:

>Append introduces non-trivial design and code complexity, which is not
>worth the cost if we don't have real users. Removing append means we
>have the property that HDFS blocks, when finalized, are immutable.
>This significantly simplifies the design and code, which significantly
>simplifies the implementation of other features like snapshots,
>HDFS-level caching, dedupe, etc.

The above is related the critical design flaw in HDFS that makes it more
complicated than necessary.

Immutable files on a node can be combined with append with copy-on-write
semantics if the blocks are small enough.  But small blocks are not going
to work with this flaw.

This flaw is the definition of a block.  It is conflated, being is two
things at once:
# An immutable segment of data that the file system tracks.
# The segment of data that is contiguous on an individual data node.

The first in any sane file system is a constant length.
The second need not be.  File systems like Ext4 and XFS use extents to map
ranges of blocks to contiguous regions on disk.  Then, they need only
track these extents rather than all the fine grained detail of each block.
 The equivalent of a block report is then an extent report.

HDFS does not have extents, and this causes extreme pressure to have large
blocks for two well known reasons:  reduction in filesystem state data,
and larger data batches for Mappers.
With extents, both of these pressures apply to extent sizes instead of
block sizes.  Blocks can be small, extents larger.  Blocks can be
immutable with copy-on-write for appends, truncate, and even random write.

Others have already implemented the above in other distributed file
systems.  But when mentioned here in the past it seemed to be ignored or
misunderstood:

http://mail-archives.apache.org/mod_mbox/hadoop-general/201110.mbox/%3C1318
437111.16477.228.camel@thinkpad%3E
The response to that was disappointing -- the extent concept did not seem
to be comprehended, and none of the good ideas from the links provided got
discussed.


I personally NEED append for some of my work and had been planning on
using it in 0.23.  However I recognize that even more than that I can't
risk losing data for my append use case.  If append is too hard and
complicated to bolt on to HDFS, perhaps a bigger re-think is required so
that such features are not so complicated and a better natural fit to the
design. 


Reply via email to