As someone who has worked with hdfs-compatible distributed file systems that support append, I can vouch for its extensive usage.
I have seen how simple it becomes to create tar archives, and later append files to them, without writing special inefficient code to do so. I have seen it used in archiving cold data, reducing MR task launch overhead without having to use a different input format, so that the same code can be used for both hot and cold data. In addition, the small-files problem in HDFS forces people to write MR code, and causes rewrite of large datasets even if a small amount of data is added to it. So, there is clearly a need for it, AFAIK. +1 on fixing it. Please let me know if you need help. - milind --- Milind Bhandarkar Greenplum Labs, EMC (Disclaimer: Opinions expressed in this email are those of the author, and do not necessarily represent the views of any organization, past or present, the author might be affiliated with.) On 3/21/12 5:36 AM, "Dave Shine" <dave.sh...@channelintelligence.com> wrote: >I am not a contributor to this project, so I don't know how much weight >my opinion carries. But I have been hoping to see append become stable >soon. We are constantly dealing with the "small file problem", and I >have written M/R jobs to periodically roll up lots of small files into a >few small ones. Having append would prevent me from needing to use up >cluster resources performing these tasks. > >Therefore, all things being equal I +1 making append work. However, if >the level of complexity is as bad as Eli implies below, then I can >understand that perhaps it is not worth the effort. If it will cause too >much technical debt, then removing it makes sense. But don't just remove >it because you don't believe there is a need for it. > >Thanks, >Dave Shine > > >-----Original Message----- >From: Eli Collins [mailto:e...@cloudera.com] >Sent: Tuesday, March 20, 2012 8:38 PM >To: hdfs-dev@hadoop.apache.org >Subject: [DISCUSS] Remove append? > >Hey gang, > >I'd like to get people's thoughts on the following proposal. I think we >should consider removing append from HDFS. > >Where we are today.. append was added in the 0.17-19 releases >(HADOOP-1700) and subsequently disabled (HADOOP-5224) due to quality >issues. It and sync were re-designed, re-implemented, and shipped in >21.0 (HDFS-265). To my knowledge, there has been no real production use. >Anecdotally people who worked on branch-20-append have told me they think >the new trunk code is substantially less well-tested than the >branch-20-append code (at least for sync, append was never well tested). >It has certainly gotten way less pounding from HBase users. >The design however, is much improved, and people think we can get hsync >(and append) stabilized in trunk (mostly testing and bug fixing). > >Rationale follows.. > >Append does not seem to be an important requirement, hflush was. There >has not been much demand for append, from users or downstream projects. >Because Hadoop 1.x does not have a working append implementation (see >HDFS-3120, the branch-20-append work was focused on sync not getting >append working) which is not enabled by default and downstream projects >will want to support Hadoop 1.x releases for years, most will not >introduce dependencies on append anyway. This is not to say demand does >not exist, just that if it does, it's been much smaller than security, >sync, HA, backwards compatbile RPC, etc. This probably explains why, over >5 years after the original implementation started, we don't have a stable >release with append. > >Append introduces non-trivial design and code complexity, which is not >worth the cost if we don't have real users. Removing append means we have >the property that HDFS blocks, when finalized, are immutable. >This significantly simplifies the design and code, which significantly >simplifies the implementation of other features like snapshots, >HDFS-level caching, dedupe, etc. > >The vast majority of the HDFS-265 effort is still leveraged w/o append. >The new data durability and read consistency behavior was the key part. > >GFS, which HDFS' design is based on, has append (and atomic record >append) so obviously a workable design does not preclude append. >However we also should not ape the GFS feature set simply because it >exists. I've had conversations with people who worked on GFS that regret >adding record append (see also >http://queue.acm.org/detail.cfm?id=1594206). In short, unless append is a >real priority for our users I think we should focus our energy elsewhere. > >Thanks, >Eli > >The information contained in this email message is considered >confidential and proprietary to the sender and is intended solely for >review and use by the named recipient. Any unauthorized review, use or >distribution is strictly prohibited. If you have received this message in >error, please advise the sender by reply email and delete the message. >