Normally Pig 0.8 is just combining the small files<http://pig.apache.org/docs/r0.8.0/cookbook.html#Combine+Small+Input+Files>into bigger ones, you should not lose any records.
You might be filtering out/limiting some records in your script. You can try just a LOAD and STORE and see that the output is the same as the input data. Romain On Sat, Feb 26, 2011 at 7:25 PM, Charles Gonçalves <[email protected]>wrote: > I tried to process a big number of small files on pig and I got a strange > problem. > > 2011-02-27 00:00:58,746 [Thread-15] INFO > org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths > to process : *43458* > 2011-02-27 00:00:58,755 [Thread-15] INFO > org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total > input > paths to process : *43458* > 2011-02-27 00:01:14,173 [Thread-15] INFO > org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total > input > paths (combined) to process : *329* > > When the script finish to process, the result is just about a subgroup of > the input files. > These are logs from a whole month, but the results are just from the day > 21. > > > Maybe I'm missing something. > Any Ideas? > > -- > *Charles Ferreira Gonçalves * > http://homepages.dcc.ufmg.br/~charles/ > UFMG - ICEx - Dcc > Cel.: 55 31 87741485 > Tel.: 55 31 34741485 > Lab.: 55 31 34095840 >
