Hadoop streaming performance: elements vs. vectors

Peter Skomoroch Sat, 28 Mar 2009 01:51:48 -0700

Hadoop streaming question: If I am forming a matrix M by summing a number of
elements generated on different mappers, is it better to emit tons of lines
from the mappers with small key,value pairs for each element, or should I
group them into row vectors before sending to the reducers?


For example, say I'm summing frequency count matrices M for each user on a
different map task, and the reducer combines the resulting sparse user count
matrices for use in another calculation.

Should I emit the individual elements:

i (j, Mij) \n
3 (1, 3.4) \n
3 (2, 3.4) \n
3 (3, 3.4) \n
4 (1, 2.3) \n
4 (2, 5.2) \n

Or posting list style vectors?

3 ((1, 3.4), (2, 3.4), (3, 3.4)) \n
4 ((1, 2.3), (2, 5.2)) \n

Using vectors will at least save some message space, but are there any other
benefits to this approach in terms of Hadoop streaming overhead (sorts
etc.)?  I think buffering issues will not be a huge concern since the length
of the vectors have a reasonable upper bound and will be in a sparse
format...


-- 
Peter N. Skomoroch
617.285.8348
http://www.datawrangling.com
http://delicious.com/pskomoroch
http://twitter.com/peteskomoroch

Hadoop streaming performance: elements vs. vectors

Reply via email to