You should be able to keep a reference to the OutputCollector provided
to the #map() method, and then use it in the #close() method.

I believe that there's a new API that will actually provide the output
collector to the close() method via a context object, but in the mean
time I think the above should work.

-----Original Message-----
From: Stuart White [mailto:stuart.whi...@gmail.com] 
Sent: 17 March 2009 12:13
To: core-user@hadoop.apache.org
Subject: Release batched-up output records at end-of-job?

I have a mapred job that simply performs data transformations in its
Mapper.  I don't need sorting or reduction, so I don't use a Reducer.

Without getting too detailed, the nature of my processing is such that
it is much more efficient if I can process blocks of records
at-a-time.  So, what I'd like to do is, in my Mapper, in the map()
function, simply add the incoming record to a list, and once that list
reaches a certain size, process the batched-up records, and then call
output.collect() multiple times to release the output records, each
corresponding to one of the input records.

At the end of the job, my Mappers will have partially full blocks of
records.  I'd like to go ahead and process these blocks at end-of-job,
regardless of their sizes, and release the corresponding output
records.

How can I accomplish this?  In my Mapper#map(), I have no way of
knowing whether a record is the final record.  The only end-of-job
hook that I'm aware of is for my Mapper to override
MapReduceBase#close(), but when in that method, there is no
OutputCollector available.

Is it possible to batch-up records, and at end-of-job, process and
release any final partial blocks?

Thanks!



This message should be regarded as confidential. If you have received this 
email in error please notify the sender and destroy it immediately.
Statements of intent shall only become binding when confirmed in hard copy by 
an authorised signatory.  The contents of this email may relate to dealings 
with other companies within the Detica Group plc group of companies.

Detica Limited is registered in England under No: 1337451.

Registered offices: Surrey Research Park, Guildford, Surrey, GU2 7YP, England.


Reply via email to