I am struggling to reproduce the functionality of a Hadoop reducer on
Spark (in Java)

in Hadoop I have a function
public void doReduce(K key, Iterator<V> values)
in Hadoop there is also a consumer (context write) which can be seen as
consume(key,value)

In my code
1) knowing the key is important to the function
2) there is neither one output tuple2 per key nor one output tuple2 per
value
3) the number of values per key might be large enough that storing them in
memory is impractical
4) keys must appear in sorted order

one good example would run through a large document using a similarity
function to look at the last 200 lines and output any of those with a
similarity of more than 0.3 (do not suggest output all and filter - the
real problem is more complex) the critical concern is an uncertain number
of tuples per key.

my questions
1) how can this be done - ideally a consumer would be a JavaPairRDD but I
don't see how to create one and add items later

2) how do I handle the entire partition, sort, process (involving calls to
doReduce) process

Reply via email to