On Wed, Mar 21, 2012 at 10:47 AM, <milind.bhandar...@emc.com> wrote: > Answers inline. > > On 3/21/12 10:32 AM, "Eli Collins" <e...@cloudera.com> wrote: > >> >>Why not just write new files and use Har files, because Har files are a >>pita? > > Yes, and har creation is an MR job, which is totally I/O bound, and yet > takes up slots/containers, reducing cluster utilization. > >>Can you elaborate on the 1st one, how it's especially helpful for >>archival? > > Say you have daily log files (consider many small job history files). > Instead of keeping them as separate files, one appends them to a monthly > files (this in itself is a complete rewrite), but appending monthly files > to year-to-date files should not require rewrite (because after March, it > becomes very inefficient.)
Why not just keep the original daily files instead of continually either rewriting (yuck) or duplicating (yuck) the data by aggregating them into rollups? I can think of two reasons: 1. If the daily files are smaller than 1 block (seems unlikely) 2. The small files problem (a typical NN can store 100-200M files, so a problem for big users) In which case maybe better to focus on #2 rather than work around it? Thanks, Eli > > Reducing number of files this way also makes it easy to copy, take > snapshots etc without having to write special parallel code to do it. > >> >>I assume the 2nd one refers to not having to Multi*InputFormat. And >>the 3rd refers to appending to an old file instead of creating a new >>one. > > Yes. > >> >>> In addition, the small-files problem in HDFS forces people to write MR >>> code, and causes rewrite of large datasets even if a small amount of >>>data >>> is added to it. > > >> >>Do people rewrite large datasets today just to add 1mb? I haven't >>heard of that from big users (Yahoo!, FB, Twitter, eBay..) or my >>customer base. If so I'd would have expected people to put energy >>into getting append working in 1.x which know was has put energy into >>(I know some people feel the 20-based design is unworkable, I don't >>know it well enough to comment there). > > With HDFS, they do not rewrite large datasets just to add a small amount > of data. Instead they create new files, and use a separate > metadata-service (or just file numbering conventions) to make the added > data part of the large dataset. But with other file systems, they just > ">>". > > Thanks, > > - milind > > >>--- >>Milind Bhandarkar >>Greenplum Labs, EMC >>(Disclaimer: Opinions expressed in this email are those of the author, >>and do not necessarily represent the views of any organization, past or >>present, the author might be affiliated with.) >