AutoInputFormat + (larger) bzip2 files cause multiple runs over same file
-------------------------------------------------------------------------

                 Key: HADOOP-6290
                 URL: https://issues.apache.org/jira/browse/HADOOP-6290
             Project: Hadoop Common
          Issue Type: Bug
    Affects Versions: 0.18.3
            Reporter: Erik Forsberg


Running a streaming job with the input directory containing a few .bzip2 files, 
each with a size of roughly 110MiB (compressed), with -inputformat
org.apache.hadoop.streaming.AutoInputFormat on the streaming commandline, each 
file is processed twice, i.e., if there are two bzip2 files in the directory, 
four mappers will be run. 

Running a wordcount M/R job, the resulting count is doubled which indicates 
that each input file is analysed twice.

This was discovered while trying out dumbo, which uses AutoInputFormat by 
default. See 
http://groups.google.com/group/dumbo-user/browse_frm/thread/84b04b2320d4bbb0?hl=en

It seems this can't be reproduced on small files. It is possible the file has 
to be larger than the DFS blocksize, in my case set to 64MiB.

I'm using Cloudera's hadoop distribution, version 
0.18.3-6cloudera0.3.0~intrepid.

Please let me know if I need to provider further details.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to