Hey all! I have got an iterative problem. I'm trying to find something similar to Hadoop's MultipleOutputs [1] in Spark 1.0. I need to build up a couple of large dense vectors (may contain billions of elements - 2 billion doubles => at least 16GB) by adding partial vector chunks to it. This can be easily done in hadoop by having two MultipleOutputs in the reducer. The reducer also writes some other outputs. I have multiple reducers running in parallel.
Without MultipleOutputs I'd have to break my job into 2-3 jobs and therefore pay a performance penalty, which seems to be the only option I'm left with in Spark. OR, could I use Accumulable [2] for this purpose? I think not, because even if I can define a custom Accumulable to do what I want, (a) I wouldn't be able to use it as an RDD (like I can use the output partitions with Hadoop in another job) in the next job/iteration directly, and (b) I wouldn't even be able to retrieve the dense vector iteratively and my vector would become driver-node-memory bound. Any ideas how I can make this work for me? Cheers, Nilesh [1]: http://hadoop.apache.org/docs/r2.3.0/api/org/apache/hadoop/mapreduce/lib/output/MultipleOutputs.html [2]: http://spark.apache.org/docs/1.0.0/api/scala/index.html#org.apache.spark.Accumulable -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Accumulable-with-huge-accumulated-value-tp7624.html Sent from the Apache Spark User List mailing list archive at Nabble.com.