On Dec 15, 2008, at 9:06 AM, Ricky Ho wrote:

Choice 1:  Emit one entry in the reduce(), using doc_name as key
==================================================================
In this case, the output will still be a single entry per invocation of reduce() ...

{
 "key":"file1",
 "value":[["apple", 25],
          ["orange", 16]]
}

In general, you don't want to gather the inputs in memory, because that limits the scalability of your application to files that only have word counts that fit in memory. For word counts, it isn't a problem, but I've seen applications that take this approach end up with records where the large ones are 500mb for a single record.

Choice 2: Emit multiple entries in the reduce(), using [doc_name, word] as key = = ====================================================================== In this case, the output will have multiple entries per invocation of reduce() ...

{
 "key":["file1", "apple"],
 "value":25
}

{
 "key":["file1", "orange"],
 "value":16
}

This looks pretty reasonable to me. Especially if the partitioning is on both the filename and word so that the loading between the reduces is relatively even.

Choice 3:  Emit one entry in the reduce(), using null key
============================================================
In this case, the output will be a single entry per invocation of reduce() ...

{
 "key":null,
 "value":[["file1", "apple", 25],
          ["file1", "orange", 16]]
}

I don't see any advantage to this one. It has all of the memory- limiting problems of option 1 and will of course do bad things if the down stream user isn't expecting null keys.

-- Owen

Reply via email to