2010YOUY01 commented on PR #14644: URL: https://github.com/apache/datafusion/pull/14644#issuecomment-2660783992
> Here is the detailed explanation: > > 1. We reserve 2X memory for each batch on insertion, so when `in_mem_batches` holds 5 batches and consumes 50 MB memory, we have already reserved 100 MB memory. The next insertion will trigger a merge, and possibly a spill. > 2. When merge happens, each batch in `in_mem_batches` was sorted individually using `sort_batch` and fed into `StreamingMergeBuilder` to build a `SortPreservingMergeStream`. The batches were taken away from `in_mem_batches`, and the original batches will be dropped immediately after retrieving a sorted batch. We assume that the sorted batches has the same size as the original batches. > 3. `SortPreservingMergeStream` polls one batch from each sorted stream, create a [row representation](https://github.com/apache/datafusion/blob/45.0.0/datafusion/physical-plan/src/sorts/stream.rs#L118-L132) or [sorted array representation](https://github.com/apache/datafusion/blob/45.0.0/datafusion/physical-plan/src/sorts/stream.rs#L183-L188) for each batch. The sorted batches were saved into `in_progress` and the row/sorted array representation were saved into `cursors`. We assume that the row representation or sorted array has the same size as the sorted batch. Now we have consumed 100MB. > 4. `SortPreservingMergeStream` produces merged batches. We can assume that the overall memory consumption remains unchanged during this process, and certainly we need to reserve memory for merged batches. Each time we poll a merged batche from `SortPreservingMergeStream`, we try reserving memory for it. If the reservation fails, all future merged batches polled from the merged stream will be directly written to the spill file. Thanks for the nice diagram, this explanation is super clear Regarding point 3, I thought of a edge case can cause `Row`s way larger than 1X: During merging, now only order key will be converted to row format. - If `select c1,c2 from t order by c1, c2`, all columns will be convereted to `Row`, the estimation is close to 1X - If `select c1, ... c10 from t order by c1`, only 1 column will be converted, the extra `Row` overhead is 0.1X. (I think this is okay to estimate more to be conservative) Edge case: let's say input is a deduplicated `StringViewArray` (like a 10k rows batch with only 100 distinct values, but payload content are stored without duplication, the array elements are just referencing to the payload range), after converting to `Row` format, every row will be materialized, then the `Row` format will have 100X expansion I think we need some mechanism to deal with this kind of edge case, perhaps this also applies to dictionary representation For point 4, are the memory budget to hold merged batches come from `sort_spill_reservation_bytes`? Small sorted runs, and converted rows should have taken up all memory spaces at this stage. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org