My team has been using DISK_ONLY. The challenge with this approach is
knowing when to unpersist if your job creates a lot of intermediate data.
The "right solution" would be to mark a transient RDD as being capable of
spilling to disk, rather than having to persist it to force this behavior.
Hopefully that will be added at some point, now that Iterable is available
in the PairRDDFunctions api.

The other thing that was important for us was setting the executor memory
to the right level because it seems some intermediate buffers can be large.

We are currently using 16 GB for spark.executor.memory and 18 GB for
SPARK_WORKER_MEMORY. Parallelism (spark.default.parallelism) seems to have
an impact, through we are still working on tuning that.

We are using 16 executors/workers.

Our test input size is about 10 GB but we generate up to a total of 500GB
of intermediate and final data.

Right now, we have gotten past our memory issues and we are now facing a
communication timeout issue in some long-tail tasks, so that's something to
watch out for.

If you come up with anything else, please let us know. :-)

-Suren


SUREN HIRAMAN, VP TECHNOLOGY
Velos
Accelerating Machine Learning

440 NINTH AVENUE, 11TH FLOOR
NEW YORK, NY 10001
O: (917) 525-2466 ext. 105
F: 646.349.4063
E: suren.hiraman@v <suren.hira...@sociocast.com>elos.io
W: www.velos.io



On Tue, Jun 10, 2014 at 9:42 PM, Allen Chang <allenc...@yahoo.com> wrote:

> Thanks for the clarification.
>
> What is the proper way to configure RDDs when your aggregate data size
> exceeds your available working memory size? In particular, in additional to
> typical operations, I'm performing cogroups, joins, and coalesces/shuffles.
>
> I see that the default storage level for RDDs is MEMORY_ONLY. Do I just
> need
> to set all the storage level for all of my RDDs to something like
> MEMORY_AND_DISK? Do I need to do anything else to get graceful behavior in
> the presence of coalesces/shuffles, cogroups, and joins?
>
> Thanks,
> Allen
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Using-Spark-on-Data-size-larger-than-Memory-size-tp6589p7364.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>



-- 

SUREN HIRAMAN, VP TECHNOLOGY
Velos
Accelerating Machine Learning

440 NINTH AVENUE, 11TH FLOOR
NEW YORK, NY 10001
O: (917) 525-2466 ext. 105
F: 646.349.4063
E: suren.hiraman@v <suren.hira...@sociocast.com>elos.io
W: www.velos.io

Reply via email to