If a user modifies keyed fields of a grouped reduce during a combine then
the reduce will receive incorrect groupings. For example, a useless
modification to word count:

  public WC reduce(WC in1, WC in2) {
    return new WC(in1.word + " " + in2.word, in1.count + in2.count);
  }

I don't see an efficient means to prevent this. Is this limitation worth
documenting, or can we safely assume that no one will ever attempt this?
MapReduce also has this limitation, and Spark gets around this by
separating keys and values and only presenting values to reduce.

"Reduce on Grouped DataSet: A Reduce transformation that is applied on a
grouped DataSet reduces each group to a single element using a user-defined
reduce function. For each group of input elements, a reduce function
successively combines pairs of elements into one element until only a
single element for each group remains."

Greg

Reply via email to