If a user modifies keyed fields of a grouped reduce during a combine then the reduce will receive incorrect groupings. For example, a useless modification to word count:
public WC reduce(WC in1, WC in2) { return new WC(in1.word + " " + in2.word, in1.count + in2.count); } I don't see an efficient means to prevent this. Is this limitation worth documenting, or can we safely assume that no one will ever attempt this? MapReduce also has this limitation, and Spark gets around this by separating keys and values and only presenting values to reduce. "Reduce on Grouped DataSet: A Reduce transformation that is applied on a grouped DataSet reduces each group to a single element using a user-defined reduce function. For each group of input elements, a reduce function successively combines pairs of elements into one element until only a single element for each group remains." Greg