I want to confirm something with the list that I'm seeing;
I needed to confirm that my Reader was reading our file format
correctly, so I created a MR job that simply output each K/V pair to the
reducer, which then just wrote out each one to the output file. This
allows me to check by hand that all K/V points of data from our file
format are getting pulled out of the file correctly. I have setup our
InputFormat, RecordReader, and Reader subclasses for our specific file
format.
While running some basic tests on a small (1meg) single file I noticed
something odd --- I was getting 2 copies of each data point in the
output file. Initially I thought my Reader was just somehow reading the
data point and not moving the read head, but I verified that was not the
case through a series of tests.
I then went on to reason that since I had 2 mappers by default on my
job, and only 1 input file, that each mapper must be reading the file
independently. I then set the -m flag to 1, and I got the proper output;
Is it safe to assume in testing on a file that is smaller than the block
size that I should always use -m 1 in order to get proper block->mapper
mapping? Also, should I assume that if you have more mappers than disk
blocks involved that you will get duplicate values? I may have set
something wrong, I just wanted to check. Thanks
Josh Patterson
TVA