I am struggling to reproduce the functionality of a Hadoop reducer on Spark (in Java)
in Hadoop I have a function public void doReduce(K key, Iterator<V> values) in Hadoop there is also a consumer (context write) which can be seen as consume(key,value) In my code 1) knowing the key is important to the function 2) there is neither one output tuple2 per key nor one output tuple2 per value 3) the number of values per key might be large enough that storing them in memory is impractical 4) keys must appear in sorted order one good example would run through a large document using a similarity function to look at the last 200 lines and output any of those with a similarity of more than 0.3 (do not suggest output all and filter - the real problem is more complex) the critical concern is an uncertain number of tuples per key. my questions 1) how can this be done - ideally a consumer would be a JavaPairRDD but I don't see how to create one and add items later 2) how do I handle the entire partition, sort, process (involving calls to doReduce) process