For multithreaded mapper, it can get more chances to combine the mapper
output. Meanwhile, the locality of some global data will also be better. But
the implementation in Hadoop 1.0.2 uses heavy synchronization, which brings
much overhead. Are there any optimization about multithreaded mapper?
s
For multithreaded mapper, it can get more chances to combine the mapper
output. Meanwhile, the locality of some global data will also be better. But
the implementation in Hadoop 1.0.2 uses heavy synchronization, which brings
much overhead. Are there any optimization about multithreaded mapper?
s
Why multithread the mapper? Just create more mappers. That way you spread the
data load as well as the mapping load potentially across multiple nodes.
kenyh wrote:
>
> I wonder if there are any optimization about the multithread mapper to
> decrease the contention of input reading and output?
But I found that synchronization is needed for record reading(read
the input Key and Value) and result output.
I use Spring Batch for that. it has io buffering builtin and it is very easy to
use and well documented.
On Thu, Jul 26, 2012 at 7:42 AM, Robert Evans wrote:
> About the only time that
> MultiThreaded mapper makes a lot of since is if there is a lot of
> computation associated with each key/value pair.
Or if the mapper does a lot of i/o to some external resource, e.g., a
web crawler.
Doug
In general multithreaded does not get you much in traditional Map/Reduce.
If you want the mappers to run faster you can drop the split size and get
a similar result, because you get more parallelism. This is the use case
that we have typically concentrated on. About the only time that
MultiThread