2010YOUY01 commented on PR #14644:
URL: https://github.com/apache/datafusion/pull/14644#issuecomment-2660783992

   > Here is the detailed explanation:
   > 
   > 1. We reserve 2X memory for each batch on insertion, so when 
`in_mem_batches` holds 5 batches and consumes 50 MB memory, we have already 
reserved 100 MB memory. The next insertion will trigger a merge, and possibly a 
spill.
   > 2. When merge happens, each batch in `in_mem_batches` was sorted 
individually using `sort_batch` and fed into `StreamingMergeBuilder` to build a 
`SortPreservingMergeStream`. The batches were taken away from `in_mem_batches`, 
and the original batches will be dropped immediately after retrieving a sorted 
batch. We assume that the sorted batches has the same size as the original 
batches.
   > 3. `SortPreservingMergeStream` polls one batch from each sorted stream, 
create a [row 
representation](https://github.com/apache/datafusion/blob/45.0.0/datafusion/physical-plan/src/sorts/stream.rs#L118-L132)
 or [sorted array 
representation](https://github.com/apache/datafusion/blob/45.0.0/datafusion/physical-plan/src/sorts/stream.rs#L183-L188)
 for each batch. The sorted batches were saved into `in_progress` and the 
row/sorted array representation were saved into `cursors`. We assume that the 
row representation or sorted array has the same size as the sorted batch. Now 
we have consumed 100MB.
   > 4. `SortPreservingMergeStream` produces merged batches. We can assume that 
the overall memory consumption remains unchanged during this process, and 
certainly we need to reserve memory for merged batches. Each time we poll a 
merged batche from `SortPreservingMergeStream`, we try reserving memory for it. 
If the reservation fails, all future merged batches polled from the merged 
stream will be directly written to the spill file.
   
   Thanks for the nice diagram, this explanation is super clear
   
   Regarding point 3, I thought of a edge case can cause `Row`s way larger than 
1X:
   During merging, now only order key will be converted to row format. 
   - If `select c1,c2 from t order by c1, c2`, all columns will be convereted 
to `Row`, the estimation is close to 1X
   - If `select c1, ... c10 from t order by c1`, only 1 column will be 
converted, the extra `Row` overhead is 0.1X. (I think this is okay to estimate 
more to be conservative)
   
   Edge case: let's say input is a deduplicated `StringViewArray` (like a 10k 
rows batch with only 100 distinct values, but payload content are stored 
without duplication, the array elements are just referencing to the payload 
range), after converting to `Row` format, every row will be materialized, then 
the `Row` format will have 100X expansion
   I think we need some mechanism to deal with this kind of edge case, perhaps 
this also applies to dictionary representation
   
   For point 4, are the memory budget to hold merged batches come from 
`sort_spill_reservation_bytes`? Small sorted runs, and converted rows should 
have taken up all memory spaces at this stage.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to