[Spark SQL] Memory problems with packing too many joins into the same WholeStageCodegen

Jianneng Li Mon, 24 Feb 2020 18:16:00 -0800

Hello everyone,

WholeStageCodegen generates code that appends 
results<https://github.com/apache/spark/blob/v3.0.0-preview2/sql/core/src/main/scala/org/apache/spark/sql/execution/WholeStageCodegenExec.scala#L771>
 into a BufferedRowIterator, which keeps the results in an in-memory linked 
list<https://github.com/apache/spark/blob/v3.0.0-preview2/sql/core/src/main/java/org/apache/spark/sql/execution/BufferedRowIterator.java#L34>.
 Long story short, this is a problem when multiple joins (i.e. 
BroadcastHashJoin) that can blow up get planned into the same WholeStageCodegen 
- results keep on accumulating in the linked list, and do not get consumed fast 
enough, eventually causing the JVM to run out of memory.


Does anyone else have experience with this problem? Some obvious solutions 
include making BufferedRowIterator spill the linked list, or make it bounded, 
but I'd imagine that this would have been done a long time ago if it were 
necessary.

Thanks,

Jianneng

[Spark SQL] Memory problems with packing too many joins into the same WholeStageCodegen

Reply via email to