Hey all!

I have got an iterative problem. I'm trying to find something similar to
Hadoop's MultipleOutputs [1] in Spark 1.0. I need to build up a couple of
large dense vectors (may contain billions of elements - 2 billion doubles =>
at least 16GB) by adding partial vector chunks to it. This can be easily
done in hadoop by having two MultipleOutputs in the reducer. The reducer
also writes some other outputs. I have multiple reducers running in
parallel.

Without MultipleOutputs I'd have to break my job into 2-3 jobs and therefore
pay a performance penalty, which seems to be the only option I'm left with
in Spark. OR, could I use Accumulable [2] for this purpose? I think not,
because even if I can define a custom Accumulable to do what I want, (a) I
wouldn't be able to use it as an RDD (like I can use the output partitions
with Hadoop in another job) in the next job/iteration directly, and (b) I
wouldn't even be able to retrieve the dense vector iteratively and my vector
would become driver-node-memory bound.

Any ideas how I can make this work for me?

Cheers,
Nilesh


[1]:
http://hadoop.apache.org/docs/r2.3.0/api/org/apache/hadoop/mapreduce/lib/output/MultipleOutputs.html
[2]:
http://spark.apache.org/docs/1.0.0/api/scala/index.html#org.apache.spark.Accumulable



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Accumulable-with-huge-accumulated-value-tp7624.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Reply via email to