Re: Suggestions of proper usage of "key" parameter ?

Owen O'Malley Mon, 15 Dec 2008 09:46:25 -0800


On Dec 15, 2008, at 9:06 AM, Ricky Ho wrote:

Choice 1:  Emit one entry in the reduce(), using doc_name as key
==================================================================

In this case, the output will still be a single entry per invocationof reduce() ...


{
 "key":"file1",
 "value":[["apple", 25],
          ["orange", 16]]
}

In general, you don't want to gather the inputs in memory, becausethat limits the scalability of your application to files that onlyhave word counts that fit in memory. For word counts, it isn't aproblem, but I've seen applications that take this approach end upwith records where the large ones are 500mb for a single record.

Choice 2: Emit multiple entries in the reduce(), using [doc_name,word] as key========================================================================In this case, the output will have multiple entries per invocationof reduce() ...
{
 "key":["file1", "apple"],
 "value":25
}

{
 "key":["file1", "orange"],
 "value":16
}

This looks pretty reasonable to me. Especially if the partitioning ison both the filename and word so that the loading between the reducesis relatively even.

Choice 3:  Emit one entry in the reduce(), using null key
============================================================

In this case, the output will be a single entry per invocation ofreduce() ...


{
 "key":null,
 "value":[["file1", "apple", 25],
          ["file1", "orange", 16]]
}

I don't see any advantage to this one. It has all of the memory-limiting problems of option 1 and will of course do bad things if thedown stream user isn't expecting null keys.


-- Owen

Re: Suggestions of proper usage of "key" parameter ?

Reply via email to