rluvaton opened a new pull request, #15700:
URL: https://github.com/apache/datafusion/pull/15700

   ## Which issue does this PR close?
   
   - Closes #14692.
   
   ## Rationale for this change
   We need merge sort that does not fail with out of memory
   
   ## What changes are included in this PR?
   
   Implemented multi level merge sort on top of `SortPreservingMergeStream` 
that spill intermediate result when not enough memory.
   
   **How does it work:**
   
   When using the `MultiLevelMerge` you provide in memory streams and spill 
files,
   each spill file contain the memory size of the record batch with the largest 
memory consumption.
   
   **Why is this important?**
   
   `SortPreservingMergeStream` uses 
[`BatchBuilder`](https://github.com/apache/datafusion/blob/172cf8d8700dfcb62015f567e56e0bff27926812/datafusion/physical-plan/src/sorts/builder.rs)
 which grow and shrink memory based on the record batches that it get. however 
if there is not enough memory it will just fail.
   
   this solution will reserve beforehand for each spill file the worst case 
scenerio for the record batch size so there will be no way that there is not 
enough memory mid sorting.
   
   it will also try to reduce buffer size and number of streams to the minimum 
when there is not enough memory and will only fail if there is not enough 
memory for holding 2 record batches with no buffering in the stream 
   
   
   It can also be easily adjusted to allow for predefined maximum memory to 
merge stream 
   
   ## Are these changes tested?
   
   Existing tests
   
   ## Are there any user-facing changes?
   
   not really
   
   ------
   
   Related to #15610


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to