2010YOUY01 opened a new issue, #15760:
URL: https://github.com/apache/datafusion/issues/15760
### Is your feature request related to a problem or challenge?
The common NLJ implementation consumes constant memory. However,
DataFusion's implementation is optimized for execution time, which requires it
to buffer all input data on one side, making it possible to fail under
memory-constrained cases.
The following pseudocode explains the current implementation:
## Normal NLJ
```
for l_batch in nlj.left.get_next_batch():
for r_batch in nlj.right.get_next_batch():
let matched_entries = match(l_batch, r_batch);
return matched_entries
nlj.right.reset(); // Let right side of the join restart and scan again
```
## DataFusion's NLJ implementation
```
for l_batch in nlj.left.get_next_batch():
buffered_left_batches.push(l_batch);
for r_batch in nlj.right.get_next_batch():
let matched_entries = match(buffered_left_batches, r_batch);
return matched_batches;
```
Related code:
https://github.com/apache/datafusion/blob/128217081ca922c404fe423c3c1b945662d53c8a/datafusion/physical-plan/src/joins/nested_loop_join.rs#L499
### Describe the solution you'd like
When the memory limit is reached when buffering the left side input, start
probing the right side. After it's done, collect the remaining left side
entries, and let the right side scan from the beginning and probe again, until
finished.
### Describe alternatives you've considered
_No response_
### Additional context
_No response_
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]