Short: Why does coalesce use huge amounts of memory? How does it work internally?

Long version:
I asked a similar question a few weeks ago, but I have a simpler test with better numbers now. I have an RDD created from some HDFS files. I want to sample it and then coalesce it into fewer partitions. For some reason coalesce uses huge amounts of memory. From what I've read, coalesce does not require full partitions to be in memory at once, so I don't understand what's causing this. Can anyone explain to me why coalesce needs so much memory? Are there any rules for determining the best number of partitions to coalesce into?

Spark version:
1.5.0

Test data:
241 GB of compress parquet files

Executors:
27 executors
16 GB memory each
3 cores each

In my tests I'm reading the data from HDFS, sampling it, coalescing into fewer partitions, and then doing a count just to have an action.

Without coalesce there is no memory issue. The size of the data makes no difference: hadoopFile (creates 14,844 partitions) -> sample (fraction 0.00075) -> count()
Per executor memory usage: 0.4 GB

Adding coalesce increases the memory usage substantially and it is still using more partitions than I'd like: hadoopFile (creates 14,844 partitions) -> sample (fraction 0.00075) -> coalesce (to 668 partitions) -> count()
Per executor memory usage: 3.1 GB

Going down to 201 partitions uses most of the available memory just for the coalesce: hadoopFile (creates 14,844 partitions) -> sample (fraction 0.00075) -> coalesce (to 201 partitions) -> count()
Per executor memory usage: 9.8 GB

Any number of partitions smaller than this will crash all the executors with out of memory. I don't really understand what is happening in Spark. That sample size should result in partitions smaller than the original partitions.

I've gone through the Spark documentation, youtube videos, and the Learning Spark book, but I haven't seen anything about this. Thanks.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to