I have an Hadoop Streaming program that crawls the web for data items, processes each retrieved item and then stores the results on S3. For each processed item a directory on S3 is created to store the results produced by the processing. At the conclusion of a program run I've been getting a duplication of each directory. E.g., if I process item A1 and item A2 I get two directories for the results of A1 and two directories for the results of A2. The corresponding directories are identical. I've checked my code and don't see anything obvious that could lead to this. Furthermore, it appears that only one map task is handling a given data item. Any suggestions on what this might be?
Thanks, John