A lot of work in Hadoop concerns splittable compression. Could this be solved by offerring compression at the HDFS block (ie 64 MB) level, just like many OS filesystems do?
http://stackoverflow.com/questions/6511255/why-cant-hadoop-split-up-a-large-text-file-and-then-compress-the-splits-using-g?rq=1 discusses this and suggests the issues is separation of concerns. However, if the compression is done at the *HDFS block* level (with perhaps a single flag indicating such), this would be totally transparent to readers and writers. This is the exact way, for example, NTFS compression works; apps need no knowledge of the compression. HDFS, since it doesn't allow random reads and writes, but only streaming, is a perfect candidate for this. Thoughts? --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org