You are using gzip so the files won't splittable. You may be better off using snappy and sequence files.
On Thu, Jan 30, 2014 at 10:51 AM, Jimmy <jimmyj...@gmail.com> wrote: > I am running few tests and would like to confirm whether this is true... > > hdfs.codeC = gzip > hdfs.fileType = CompressedStream > hdfs.writeFormat = Text > hdfs.batchSize = 100 > > > now lets assume I have large number of transactions I roll file every 10 > minutes > > it seems the tmp file stay 0bytes and flushes at once after 10 minutes vs > if I dont use compression, the file will grow as data are written to HDFS > > is this correct? > > Do you see any drawback in using compressedstream and with very large > files? In my case 120MB compressed file (block size) is 10x uncompressed > >