thank you Daniel. follow up question: is there any reasosn why bzip is processed by pig but gzip is processed in Hadoop?
thanks, Tomas On Mon, May 18, 2015 at 8:35 AM, Daniel Dai <[email protected]> wrote: > The umcompression of gzip is on Hadoop side (TextInputFormat), if Hadoop > fixed concatenated gzip, Pig should be fixed as well. Bzip however, is > processed by Pig code, that does not support concatenation. > > It seems we need to update the documentation. > > Daniel > > On 5/5/15, 3:51 AM, "Tomas Hudik" <[email protected]> wrote: > > >Hi, > >I read a section: > >https://pig.apache.org/docs/r0.11.1/func.html#handling-compression > > > >according to which any concatenated bzip/gzip files will produce strange > >results. > >I did a test - concatenated some files and processed them. However, all > >the > >results were identical to ones that were produces on non-concatenated > >files. Why? They should be different... > > > >Then I saw: https://issues.apache.org/jira/i#browse/HADOOP-6835 > > > >My questions: > >1. is https://pig.apache.org/docs/r0.11.1/func.html#handling-compression > >still correct and concatenation will produce wrong results? Is this true > >for any concatenated files or it might happanes once a time > >2. is there any way how to find out whether tar.gz or tar.bz2 is > >concatenated? > >
