Duplicate Output Directories in S3

S D Sun, 22 Mar 2009 21:46:14 -0700

I have an Hadoop Streaming program that crawls the web for data items,
processes each retrieved item and then stores the results on S3. For each
processed item a directory on S3 is created to store the results produced by
the processing. At the conclusion of a program run I've been getting a
duplication of each directory. E.g., if I process item A1 and item A2 I get
two directories for the results of A1 and two directories for the results of
A2. The corresponding directories are identical. I've checked my code and
don't see anything obvious that could lead to this. Furthermore, it appears
that only one map task is handling a given data item. Any suggestions on
what this might be?


Thanks,
John

Duplicate Output Directories in S3

Reply via email to