It might not be the same as a real hadoop reducer, but I think it would accomplish the same. Take a look at:
import org.apache.spark.SparkContext._ // val rdd: RDD[(K, V)] // def zero(value: V): S // def reduce(agg: S, value: V): S // def merge(agg1: S, agg2: S): S val reducedUnsorted: RDD[(K, S)] = rdd.combineByKey[Int](zero, reduce, merge) reducedUnsorted.sortByKey() On Fri, Sep 19, 2014 at 1:37 PM, Steve Lewis <lordjoe2...@gmail.com> wrote: > I am struggling to reproduce the functionality of a Hadoop reducer on > Spark (in Java) > > in Hadoop I have a function > public void doReduce(K key, Iterator<V> values) > in Hadoop there is also a consumer (context write) which can be seen as > consume(key,value) > > In my code > 1) knowing the key is important to the function > 2) there is neither one output tuple2 per key nor one output tuple2 per > value > 3) the number of values per key might be large enough that storing them in > memory is impractical > 4) keys must appear in sorted order > > one good example would run through a large document using a similarity > function to look at the last 200 lines and output any of those with a > similarity of more than 0.3 (do not suggest output all and filter - the > real problem is more complex) the critical concern is an uncertain number > of tuples per key. > > my questions > 1) how can this be done - ideally a consumer would be a JavaPairRDD but I > don't see how to create one and add items later > > 2) how do I handle the entire partition, sort, process (involving calls to > doReduce) process > > >