Re: MultithreadedMapper

2012-07-26 Thread kenyh
For multithreaded mapper, it can get more chances to combine the mapper output. Meanwhile, the locality of some global data will also be better. But the implementation in Hadoop 1.0.2 uses heavy synchronization, which brings much overhead. Are there any optimization about multithreaded mapper? s

Re: MultithreadedMapper

2012-07-26 Thread kenyh
For multithreaded mapper, it can get more chances to combine the mapper output. Meanwhile, the locality of some global data will also be better. But the implementation in Hadoop 1.0.2 uses heavy synchronization, which brings much overhead. Are there any optimization about multithreaded mapper? s

Re: MultithreadedMapper

2012-07-26 Thread syscokid
Why multithread the mapper? Just create more mappers. That way you spread the data load as well as the mapping load potentially across multiple nodes. kenyh wrote: > > I wonder if there are any optimization about the multithread mapper to > decrease the contention of input reading and output?

Re: MultithreadedMapper

2012-07-26 Thread Radim Kolar
But I found that synchronization is needed for record reading(read the input Key and Value) and result output. I use Spring Batch for that. it has io buffering builtin and it is very easy to use and well documented.

Re: MultithreadedMapper

2012-07-26 Thread Doug Cutting
On Thu, Jul 26, 2012 at 7:42 AM, Robert Evans wrote: > About the only time that > MultiThreaded mapper makes a lot of since is if there is a lot of > computation associated with each key/value pair. Or if the mapper does a lot of i/o to some external resource, e.g., a web crawler. Doug

Re: MultithreadedMapper

2012-07-26 Thread Robert Evans
In general multithreaded does not get you much in traditional Map/Reduce. If you want the mappers to run faster you can drop the split size and get a similar result, because you get more parallelism. This is the use case that we have typically concentrated on. About the only time that MultiThread