Guys, The amount of data in the source dir: hdfs://hydra1:57810/user/cdh-hadoop/mscdata/201010_raw 22567369111
What I did was: I run with all logs, 43458 and the counters are: FILE_BYTES_READ253,905,706372,708,857626,614,563HDFS_BYTES_READ2,553,123,7340 2,553,123,734FILE_BYTES_WRITTEN619,877,917372,708,857992,586,774 HDFS_BYTES_WRITTEN 0535535 I did a manual join of the files and run again for the 336 files (the merge of all those files). The job didn't finished yet and the counters are: FILE_BYTES_READ21,054,970,818021,054,970,818HDFS_BYTES_READ16,772,063,486 0 16,772,063,486FILE_BYTES_WRITTEN39,797,038,00810,404,287,55150,201,325,55 I think that the problem could be in the combination of the input files. Is the combination class aware of compression. Because *all my files are compressed*. Maybe the class perform a concatenation and we fall in the hdfs limitation of gzip concatenated files. On Mon, Feb 28, 2011 at 8:47 PM, Charles Gonçalves <[email protected]>wrote: > > > On Mon, Feb 28, 2011 at 7:39 PM, Thejas M Nair <[email protected]>wrote: > >> Hi Charles, >> Which load function are you using ? >> > I'm using a UD load function .. > > Is the default (PigStorage?). >> > Nops ... > > >> In the hadoop counters for the job in the jobtracker ui, do you see the >> expected number of input records being read? >> > Is possible to see the counter in the history interface on JobTracker? > I will run the jobs again to compare the counter, but my guess is probably > not! > > -Thejas >> >> >> >> >> On 2/28/11 10:57 AM, "Charles Gonçalves" <[email protected]> wrote: >> >> I'm not using any filtering in the script. >> I'm just want to see the total traffic per day in all logs. >> >> If I combine 1000 log files into one and run the script on this log files >> I >> got the correct answer for those logs. >> But when I'm run with all the *43458* log files I got a incorrect >> output. >> The correct would be an histogram for each day from 2010-10 but the result >> contain only data from 2010-10-21. >> And if I process all the logs with an awk script I got the correct answer. >> >> >> On Mon, Feb 28, 2011 at 3:29 PM, Daniel Dai <[email protected]> >> wrote: >> >> > Not sure if I get your question. In 0.8, Pig combine small files into >> one >> > map, so it is possible you get less output files. >> >> This is not the problem. >> But thanks anyway! >> >> If that is your concern, you can try to disable split combine using >> > "-Dpig.splitCombination=false" >> > >> > Daniel >> > >> > >> > Charles Gonçalves wrote: >> > >> >> I tried to process a big number of small files on pig and I got a >> strange >> >> problem. >> >> >> >> 2011-02-27 00:00:58,746 [Thread-15] INFO >> >> org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input >> paths >> >> to process : *43458* >> >> 2011-02-27 00:00:58,755 [Thread-15] INFO >> >> org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total >> >> input >> >> paths to process : *43458* >> >> 2011-02-27 00:01:14,173 [Thread-15] INFO >> >> org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total >> >> input >> >> paths (combined) to process : *329* >> >> >> >> When the script finish to process, the result is just about a subgroup >> of >> >> the input files. >> >> These are logs from a whole month, but the results are just from the >> day >> >> 21. >> >> >> >> >> >> Maybe I'm missing something. >> >> Any Ideas? >> >> >> >> >> >> >> > >> > >> >> >> -- >> *Charles Ferreira Gonçalves * >> http://homepages.dcc.ufmg.br/~charles/ >> UFMG - ICEx - Dcc >> Cel.: 55 31 87741485 >> Tel.: 55 31 34741485 >> Lab.: 55 31 34095840 >> >> >> > > > -- > *Charles Ferreira Gonçalves * > http://homepages.dcc.ufmg.br/~charles/ > UFMG - ICEx - Dcc > Cel.: 55 31 87741485 > Tel.: 55 31 34741485 > Lab.: 55 31 34095840 > -- *Charles Ferreira Gonçalves * http://homepages.dcc.ufmg.br/~charles/ UFMG - ICEx - Dcc Cel.: 55 31 87741485 Tel.: 55 31 34741485 Lab.: 55 31 34095840
