I have a mapred job that simply performs data transformations in its Mapper. I don't need sorting or reduction, so I don't use a Reducer.
Without getting too detailed, the nature of my processing is such that it is much more efficient if I can process blocks of records at-a-time. So, what I'd like to do is, in my Mapper, in the map() function, simply add the incoming record to a list, and once that list reaches a certain size, process the batched-up records, and then call output.collect() multiple times to release the output records, each corresponding to one of the input records. At the end of the job, my Mappers will have partially full blocks of records. I'd like to go ahead and process these blocks at end-of-job, regardless of their sizes, and release the corresponding output records. How can I accomplish this? In my Mapper#map(), I have no way of knowing whether a record is the final record. The only end-of-job hook that I'm aware of is for my Mapper to override MapReduceBase#close(), but when in that method, there is no OutputCollector available. Is it possible to batch-up records, and at end-of-job, process and release any final partial blocks? Thanks!