The storage fraction only limits the amount of memory used for storage. It doesn't actually limit anything else. I.e you can use all the memory if you want in collect.
On Sunday, September 28, 2014, Brad Miller <bmill...@eecs.berkeley.edu> wrote: > Hi All, > > I am interested to collect() a large RDD so that I can run a learning > algorithm on it. I've noticed that when I don't increase > SPARK_DRIVER_MEMORY I can run out of memory. I've also noticed that it > looks like the same fraction of memory is reserved for storage on the > driver as on the worker nodes, and that the web UI doesn't show any storage > usage on the driver. Since that memory is reserved for storage, it seems > possible that it is not being used towards the collection of my RDD. > > Is there a way to configure the memory management ( > spark.storage.memoryFraction, spark.shuffle.memoryFraction) for the > driver separately from the workers? > > Is there any reason to leave space for shuffle or storage on the driver? > It seems like I never see either of these used on the web UI, although I > may not be interpreting the UI correctly or my jobs may not trigger the use > case. > > For context, I am using PySpark (so much of my processing happens outside > of the allocated memory in Java) and running the Spark 1.1.0 release > binaries. > > best, > -Brad >