On Dec 15, 2008, at 9:06 AM, Ricky Ho wrote:
Choice 1: Emit one entry in the reduce(), using doc_name as key
==================================================================
In this case, the output will still be a single entry per invocation
of reduce() ...
{
"key":"file1",
"value":[["apple", 25],
["orange", 16]]
}
In general, you don't want to gather the inputs in memory, because
that limits the scalability of your application to files that only
have word counts that fit in memory. For word counts, it isn't a
problem, but I've seen applications that take this approach end up
with records where the large ones are 500mb for a single record.
Choice 2: Emit multiple entries in the reduce(), using [doc_name,
word] as key
=
=
======================================================================
In this case, the output will have multiple entries per invocation
of reduce() ...
{
"key":["file1", "apple"],
"value":25
}
{
"key":["file1", "orange"],
"value":16
}
This looks pretty reasonable to me. Especially if the partitioning is
on both the filename and word so that the loading between the reduces
is relatively even.
Choice 3: Emit one entry in the reduce(), using null key
============================================================
In this case, the output will be a single entry per invocation of
reduce() ...
{
"key":null,
"value":[["file1", "apple", 25],
["file1", "orange", 16]]
}
I don't see any advantage to this one. It has all of the memory-
limiting problems of option 1 and will of course do bad things if the
down stream user isn't expecting null keys.
-- Owen