My team has been using DISK_ONLY. The challenge with this approach is knowing when to unpersist if your job creates a lot of intermediate data. The "right solution" would be to mark a transient RDD as being capable of spilling to disk, rather than having to persist it to force this behavior. Hopefully that will be added at some point, now that Iterable is available in the PairRDDFunctions api.
The other thing that was important for us was setting the executor memory to the right level because it seems some intermediate buffers can be large. We are currently using 16 GB for spark.executor.memory and 18 GB for SPARK_WORKER_MEMORY. Parallelism (spark.default.parallelism) seems to have an impact, through we are still working on tuning that. We are using 16 executors/workers. Our test input size is about 10 GB but we generate up to a total of 500GB of intermediate and final data. Right now, we have gotten past our memory issues and we are now facing a communication timeout issue in some long-tail tasks, so that's something to watch out for. If you come up with anything else, please let us know. :-) -Suren SUREN HIRAMAN, VP TECHNOLOGY Velos Accelerating Machine Learning 440 NINTH AVENUE, 11TH FLOOR NEW YORK, NY 10001 O: (917) 525-2466 ext. 105 F: 646.349.4063 E: suren.hiraman@v <suren.hira...@sociocast.com>elos.io W: www.velos.io On Tue, Jun 10, 2014 at 9:42 PM, Allen Chang <allenc...@yahoo.com> wrote: > Thanks for the clarification. > > What is the proper way to configure RDDs when your aggregate data size > exceeds your available working memory size? In particular, in additional to > typical operations, I'm performing cogroups, joins, and coalesces/shuffles. > > I see that the default storage level for RDDs is MEMORY_ONLY. Do I just > need > to set all the storage level for all of my RDDs to something like > MEMORY_AND_DISK? Do I need to do anything else to get graceful behavior in > the presence of coalesces/shuffles, cogroups, and joins? > > Thanks, > Allen > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/Using-Spark-on-Data-size-larger-than-Memory-size-tp6589p7364.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > -- SUREN HIRAMAN, VP TECHNOLOGY Velos Accelerating Machine Learning 440 NINTH AVENUE, 11TH FLOOR NEW YORK, NY 10001 O: (917) 525-2466 ext. 105 F: 646.349.4063 E: suren.hiraman@v <suren.hira...@sociocast.com>elos.io W: www.velos.io