coalesce executor memory explosion

Christopher Brady Wed, 24 Feb 2016 13:35:01 -0800

Short: Why does coalesce use huge amounts of memory? How does it workinternally?


Long version:

I asked a similar question a few weeks ago, but I have a simpler testwith better numbers now. I have an RDD created from some HDFS files. Iwant to sample it and then coalesce it into fewer partitions. For somereason coalesce uses huge amounts of memory. From what I've read,coalesce does not require full partitions to be in memory at once, so Idon't understand what's causing this. Can anyone explain to me whycoalesce needs so much memory? Are there any rules for determining thebest number of partitions to coalesce into?


Spark version:
1.5.0

Test data:
241 GB of compress parquet files

Executors:
27 executors
16 GB memory each
3 cores each

In my tests I'm reading the data from HDFS, sampling it, coalescing intofewer partitions, and then doing a count just to have an action.

Without coalesce there is no memory issue. The size of the data makes nodifference:hadoopFile (creates 14,844 partitions) -> sample (fraction 0.00075) ->count()

Per executor memory usage: 0.4 GB

Adding coalesce increases the memory usage substantially and it is stillusing more partitions than I'd like:hadoopFile (creates 14,844 partitions) -> sample (fraction 0.00075) ->coalesce (to 668 partitions) -> count()

Per executor memory usage: 3.1 GB

Going down to 201 partitions uses most of the available memory just forthe coalesce:hadoopFile (creates 14,844 partitions) -> sample (fraction 0.00075) ->coalesce (to 201 partitions) -> count()

Per executor memory usage: 9.8 GB

Any number of partitions smaller than this will crash all the executorswith out of memory. I don't really understand what is happening inSpark. That sample size should result in partitions smaller than theoriginal partitions.

I've gone through the Spark documentation, youtube videos, and theLearning Spark book, but I haven't seen anything about this. Thanks.


---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

coalesce executor memory explosion

Reply via email to