Re: Combiner run specification and questions

Chris Douglas Tue, 06 Jan 2009 15:34:08 -0800

I agree with the requirement that the key does not change. Ofcourse, the values can change.

Yes. Key attributes not relevant to ordering are also mutable. If yourkey is a (t1, t2) tuple ordered by t1, then t2 can change withoutaffecting the expected semantics.

I am primarily worried that the combiner might not be run at all - Ihave 'successfully' integrated Hadoop and R i.e the user can providemap/reduce functions written in R. However, R is not great withmemory management, and if I have N (N is huge) values for a givenkey K, then R will baulk when it comes to processing this.

That sounds really cool. The cases where the combiner is not run arecurrently obscure and future exceptions are unlikely to affect thisparticular case. For example, it doesn't make sense to run thecombiner for a single record (i.e. there is nothing to combine), for apartition with no common keys, etc. So as long as the correctness ofthe computation doesn't rely on a transformation performed in thecombiner, it should be OK. In general, it shouldn't be- and in thefuture may not be- run where there is little or no compression for itto effect, which doesn't exacerbate this case.

However, this restriction limits the scalability of your solution. Itmight be necessary to work around R's limitations by breaking up largecomputations into intermediate steps, possibly by explicitlyinstantiating and running the combiner in the reduce.

1) I am guaranteed a reducer.
So,
The combiner, if defined, will run zero or more times on recordsemitted from the map, before being fed to the reduce.
This zero case possibility worries me. However you mention, that itoccurs
collector spills in the map
I have noticed this happening - what 'spilling' mean?

Records emitted from the map are serialized into a buffer, which isperiodically written to disk when it is (sufficiently) full. Each ofthese batch writes is a "spill". In casual usage, it refers to anytime when records need to be written to disk. The merge ofintermediate files into the final map output and merging in-memorysegments to disk in the reduce are two examples. -C

Thank you
Saptarshi


On Jan 5, 2009, at 10:22 PM, Chris Douglas wrote:
The combiner, if defined, will run zero or more times on recordsemitted from the map, before being fed to the reduce. It is runwhen the collector spills in the map and in some merge cases. Ifthe combiner transforms the key, it is illegal to change its type,the partition to which it is assigned, or its ordering.
For example, if you emit a record (k,v) from your map and (k',v)from the combiner, your comparator is C(K,K) and your partitionerfunction is P(K), it must be the case that P(k) == P(k') andC(k,k') == 0. If either of these does not hold, the semantics tothe reduce are broken. Clearly, if k is not transformed (as in truefor most combiners), this holds trivially.
As was mentioned earlier, the purpose of the combiner is tocompress data pulled across the network and spilled to disk. Itshould not affect the correctness or, in most cases, the output ofthe job. -C
On Jan 2, 2009, at 9:57 AM, Saptarshi Guha wrote:
Hello,
I would just like to confirm, when does the Combiner run(since it
might not be run at all,see below). I read somewhere that it is run,
if there is at least one reduce (which in my case i can be sure of).
I also read, that the combiner is an optimization. However, it isalsoa chance for a function to transform the key/value (keeping theclass
the same i.e the combiner semantics are not changed) and deal with a
smaller set ( this could be done in the reducer but the number of
values for a key might be relatively large).
However, I guess it would be a mistake for reducer to expect itsinput
coming from a combiner? E.g if there are only 10 value corresponding
to a key(as outputted by the mapper), will these 10 values gostraight
to the reducer or to the reducer via the combiner?

Here I am assuming my reduce operations does not need all the values
for a key to work(so that a combiner can be used) i.e additive
operations.

Thank you
Saptarshi
On Sun, Nov 16, 2008 at 6:18 PM, Owen O'Malley<[email protected]> wrote:
The Combiner may be called 0, 1, or many times on each keybetween themapper and reducer. Combiners are just an application specificoptimizationthat compress the intermediate output. They should not have sideeffects ortransform the types. Unfortunately, since there isn't a separateinterfacefor Combiners, there is isn't a great place to document thisrequirement.
I've just filed HADOOP-4668 to improve the documentation.
--
Saptarshi Guha - [email protected]
Saptarshi Guha | [email protected] | http://www.stat.purdue.edu/~sguha
The way of the world is to praise dead saints and prosecute live ones.
                -- Nathaniel Howe

Re: Combiner run specification and questions

Reply via email to