Guys,

The amount of data in the source dir:
hdfs://hydra1:57810/user/cdh-hadoop/mscdata/201010_raw  22567369111

What I did was:
I run with all logs, 43458 and the counters are:

FILE_BYTES_READ253,905,706372,708,857626,614,563HDFS_BYTES_READ2,553,123,7340
2,553,123,734FILE_BYTES_WRITTEN619,877,917372,708,857992,586,774
HDFS_BYTES_WRITTEN 0535535


I did a manual join of the files and run again for the 336 files (the merge
of all those files).
The job didn't finished yet and the counters are:

FILE_BYTES_READ21,054,970,818021,054,970,818HDFS_BYTES_READ16,772,063,486 0
16,772,063,486FILE_BYTES_WRITTEN39,797,038,00810,404,287,55150,201,325,55


I think that the problem could be in the combination of the input files.
Is the combination class aware of compression.
Because *all my files are compressed*.
Maybe the class perform a concatenation and we fall in the hdfs limitation
of gzip concatenated files.

On Mon, Feb 28, 2011 at 8:47 PM, Charles Gonçalves <[email protected]>wrote:

>
>
> On Mon, Feb 28, 2011 at 7:39 PM, Thejas M Nair <[email protected]>wrote:
>
>>  Hi Charles,
>> Which load function are you using ?
>>
> I'm using a UD load function ..
>
> Is the default (PigStorage?).
>>
> Nops ...
>
>
>>  In the hadoop counters for the job in the jobtracker ui, do you see the
>> expected number of input records being read?
>>
> Is possible to see the counter in the history interface on JobTracker?
> I will run the jobs again to compare the counter, but my guess is probably
> not!
>
>  -Thejas
>>
>>
>>
>>
>> On 2/28/11 10:57 AM, "Charles Gonçalves" <[email protected]> wrote:
>>
>> I'm not using any filtering in the script.
>> I'm just want to see the total traffic per day in all logs.
>>
>> If I combine 1000 log files into  one and run the script on this log files
>> I
>> got the correct answer for those logs.
>> But when I'm run with   all the *43458* log files I got a incorrect
>> output.
>> The correct would be an histogram for each day from 2010-10 but the result
>> contain only data from 2010-10-21.
>> And if I process all the logs with an awk script I got the correct answer.
>>
>>
>> On Mon, Feb 28, 2011 at 3:29 PM, Daniel Dai <[email protected]>
>> wrote:
>>
>> > Not sure if I get your question. In 0.8, Pig combine small files into
>> one
>> > map, so it is possible you get less output files.
>>
>> This is not the problem.
>> But thanks anyway!
>>
>> If that is your concern, you can try to disable split combine using
>> > "-Dpig.splitCombination=false"
>> >
>> > Daniel
>> >
>> >
>> > Charles Gonçalves wrote:
>> >
>> >> I tried to process a big number of small files on pig and I got a
>> strange
>> >> problem.
>> >>
>> >> 2011-02-27 00:00:58,746 [Thread-15] INFO
>> >>  org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input
>> paths
>> >> to process : *43458*
>> >> 2011-02-27 00:00:58,755 [Thread-15] INFO
>> >>  org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total
>> >> input
>> >> paths to process : *43458*
>> >> 2011-02-27 00:01:14,173 [Thread-15] INFO
>> >>  org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total
>> >> input
>> >> paths (combined) to process : *329*
>> >>
>> >> When the script finish to process, the result is just about a subgroup
>> of
>> >> the input files.
>> >> These are logs from a whole month,  but the results are just from the
>> day
>> >> 21.
>> >>
>> >>
>> >> Maybe I'm missing something.
>> >> Any Ideas?
>> >>
>> >>
>> >>
>> >
>> >
>>
>>
>> --
>> *Charles Ferreira Gonçalves *
>> http://homepages.dcc.ufmg.br/~charles/
>> UFMG - ICEx - Dcc
>> Cel.: 55 31 87741485
>> Tel.:  55 31 34741485
>> Lab.: 55 31 34095840
>>
>>
>>
>
>
> --
> *Charles Ferreira Gonçalves *
> http://homepages.dcc.ufmg.br/~charles/
> UFMG - ICEx - Dcc
> Cel.: 55 31 87741485
> Tel.:  55 31 34741485
> Lab.: 55 31 34095840
>



-- 
*Charles Ferreira Gonçalves *
http://homepages.dcc.ufmg.br/~charles/
UFMG - ICEx - Dcc
Cel.: 55 31 87741485
Tel.:  55 31 34741485
Lab.: 55 31 34095840

Reply via email to