Re: Spark Memory Bounds

2014-05-28 Thread Keith Simmons
Thanks! Sounds like my rough understanding was roughly right :) Definitely understand cached RDDs can add to the memory requirements. Luckily, like you mentioned, you can configure spark to flush that to disk and bound its total size in memory via spark.storage.memoryFraction, so I have a pretty

Re: Spark Memory Bounds

2014-05-28 Thread Christopher Nguyen
Keith, please see inline. -- Christopher T. Nguyen Co-founder & CEO, Adatao linkedin.com/in/ctnguyen On Tue, May 27, 2014 at 7:22 PM, Keith Simmons wrote: > A dash of both. I want to know enough that I can "reason about", rather > than "strictly control", the amount of me

Re: Spark Memory Bounds

2014-05-27 Thread Keith Simmons
A dash of both. I want to know enough that I can "reason about", rather than "strictly control", the amount of memory Spark will use. If I have a big data set, I want to understand how I can design it so that Spark's memory consumption falls below my available resources. Or alternatively, if it'

Re: Spark Memory Bounds

2014-05-27 Thread Christopher Nguyen
Keith, do you mean "bound" as in (a) strictly control to some quantifiable limit, or (b) try to minimize the amount used by each task? If "a", then that is outside the scope of Spark's memory management, which you should think of as an application-level (that is, above JVM) mechanism. In this scop

Spark Memory Bounds

2014-05-27 Thread Keith Simmons
I'm trying to determine how to bound my memory use in a job working with more data than can simultaneously fit in RAM. From reading the tuning guide, my impression is that Spark's memory usage is roughly the following: (A) In-Memory RDD use + (B) In memory Shuffle use + (C) Transient memory used