Hadoop streaming question: If I am forming a matrix M by summing a number of elements generated on different mappers, is it better to emit tons of lines from the mappers with small key,value pairs for each element, or should I group them into row vectors before sending to the reducers?
For example, say I'm summing frequency count matrices M for each user on a different map task, and the reducer combines the resulting sparse user count matrices for use in another calculation. Should I emit the individual elements: i (j, Mij) \n 3 (1, 3.4) \n 3 (2, 3.4) \n 3 (3, 3.4) \n 4 (1, 2.3) \n 4 (2, 5.2) \n Or posting list style vectors? 3 ((1, 3.4), (2, 3.4), (3, 3.4)) \n 4 ((1, 2.3), (2, 5.2)) \n Using vectors will at least save some message space, but are there any other benefits to this approach in terms of Hadoop streaming overhead (sorts etc.)? I think buffering issues will not be a huge concern since the length of the vectors have a reasonable upper bound and will be in a sparse format... -- Peter N. Skomoroch 617.285.8348 http://www.datawrangling.com http://delicious.com/pskomoroch http://twitter.com/peteskomoroch