Re: RecordReader design heuristic

Jeff Eastman Tue, 17 Mar 2009 15:02:43 -0700

Hi Josh,

Well, I don't really see how you will get more mappers, just simplerlogic in the mapper. The number of mappers is driven by how many inputfiles you have and their sizes and not by any chunking you do in therecord reader. Each record reader will get an entire split and will feedit to its mapper in a stream one record at a time. You can duplicatesome of that logic in the mapper if you want but you already will haveit in the reader so why bother?


Jeff


Patterson, Josh wrote:

Jeff,
So if I'm hearing you right, its "good" to send one point of data (10
bytes here) to a single mapper? This mind set increases the number of
mappers, but keeps their logic scaled down to simply "look at this
record and emit/don't emit" --- which is considered more favorable? I'm
still getting the hang of the MR design tradeoffs, thanks for your
feedback.

Josh Patterson
TVA

-----Original Message-----
From: Jeff Eastman [mailto:j...@windwardsolutions.com]Sent: Tuesday, March 17, 2009 5:11 PM
To: core-user@hadoop.apache.org
Subject: Re: RecordReader design heuristic
If you send a single point to the mapper, your mapper logic will beclean and simple. Otherwise you will need to loop over your block ofpoints in the mapper. In Mahout clustering, I send the mapper individual
points because the input file is point-per-line. In either case, therecord reader will be iterating over a block of data to provide mapperinputs. IIRC, splits will generally be an HDFS block or less, so if youhave files smaller than that you will get one mapper per. For largerfiles you can get up to one mapper per split block.
Jeff

Patterson, Josh wrote:
I am currently working on a RecordReader to read a custom time series
data binary file format and was wondering about ways to be most
efficient in designing the InputFormat/RecordReader process. Reading
through:
http://wiki.apache.org/hadoop/HadoopMapReduce<http://wiki.apache.org/hadoop/HadoopMapReduce>gave me a lot of hints about how the various classes work together in
order to read any type of file. I was looking at how the
TextInputFormat
uses the LineRecordReader in order to send individual lines to each
mapper. My question is, what is a good heuristic in how to choose how
much data to send to each mapper? With the stock LineRecordReader each
mapper only gets to work with a single line which leads me to believe
that we want to give each mapper very little work. Currently I'm
looking
at either sending each mapper a single point of data (10 bytes), which
seems small, or sending a single mapper a block of data (around 819
points, at 10 bytes each, ---> 8190 bytes). I'm leaning towards
sending
the block to the mapper.
These factors are based around dealing with a legacy file format (for
now) so I'm just trying to make the best tradeoff possible for the
short
term until I get some basic stuff rolling, at which point I can
suggest
a better storage format, or just start converting the groups of stored
points into a format more fitting for the platform. I understand that
the InputFormat is not really trying to make much meaning out of the
data, other than to help assist in getting the correct data out of the
file based on the file split variables. Another question I have is,
with
a pretty much stock install, generally how big is each FileSplit?
Josh Patterson
TVA

Re: RecordReader design heuristic

Reply via email to