AutoInputFormat + (larger) bzip2 files cause multiple runs over same file -------------------------------------------------------------------------
Key: HADOOP-6290 URL: https://issues.apache.org/jira/browse/HADOOP-6290 Project: Hadoop Common Issue Type: Bug Affects Versions: 0.18.3 Reporter: Erik Forsberg Running a streaming job with the input directory containing a few .bzip2 files, each with a size of roughly 110MiB (compressed), with -inputformat org.apache.hadoop.streaming.AutoInputFormat on the streaming commandline, each file is processed twice, i.e., if there are two bzip2 files in the directory, four mappers will be run. Running a wordcount M/R job, the resulting count is doubled which indicates that each input file is analysed twice. This was discovered while trying out dumbo, which uses AutoInputFormat by default. See http://groups.google.com/group/dumbo-user/browse_frm/thread/84b04b2320d4bbb0?hl=en It seems this can't be reproduced on small files. It is possible the file has to be larger than the DFS blocksize, in my case set to 64MiB. I'm using Cloudera's hadoop distribution, version 0.18.3-6cloudera0.3.0~intrepid. Please let me know if I need to provider further details. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.