Answers inline. On 3/21/12 10:32 AM, "Eli Collins" <e...@cloudera.com> wrote:
> >Why not just write new files and use Har files, because Har files are a >pita? Yes, and har creation is an MR job, which is totally I/O bound, and yet takes up slots/containers, reducing cluster utilization. >Can you elaborate on the 1st one, how it's especially helpful for >archival? Say you have daily log files (consider many small job history files). Instead of keeping them as separate files, one appends them to a monthly files (this in itself is a complete rewrite), but appending monthly files to year-to-date files should not require rewrite (because after March, it becomes very inefficient.) Reducing number of files this way also makes it easy to copy, take snapshots etc without having to write special parallel code to do it. > >I assume the 2nd one refers to not having to Multi*InputFormat. And >the 3rd refers to appending to an old file instead of creating a new >one. Yes. > >> In addition, the small-files problem in HDFS forces people to write MR >> code, and causes rewrite of large datasets even if a small amount of >>data >> is added to it. > >Do people rewrite large datasets today just to add 1mb? I haven't >heard of that from big users (Yahoo!, FB, Twitter, eBay..) or my >customer base. If so I'd would have expected people to put energy >into getting append working in 1.x which know was has put energy into >(I know some people feel the 20-based design is unworkable, I don't >know it well enough to comment there). With HDFS, they do not rewrite large datasets just to add a small amount of data. Instead they create new files, and use a separate metadata-service (or just file numbering conventions) to make the added data part of the large dataset. But with other file systems, they just ">>". Thanks, - milind >--- >Milind Bhandarkar >Greenplum Labs, EMC >(Disclaimer: Opinions expressed in this email are those of the author, >and do not necessarily represent the views of any organization, past or >present, the author might be affiliated with.)