EeshanBembi commented on issue #19414:
URL: https://github.com/apache/datafusion/issues/19414#issuecomment-3681930247
Hey @bharath-techie
I've opened PR #19444 to address this issue. The fix adds garbage collection
for StringView/BinaryView arrays before spilling to disk, which reduces spill
file sizes by ~96% (820MB → 33MB) as reported.
The implementation:
- Performs GC on StringView/BinaryView columns in
InProgressSpillFile::append_batch() before writing
- Skips GC for small arrays (<10 rows) and when no buffers need
compaction(10 rows is an arbitrary number and can be changed)
- Includes comprehensive tests including a specific reproduction of this
ClickBench issue(which could be removed/modified)
The approach aligns with @alamb's suggestion to GC during spill when the
waste ratio is high. Currently using a simple heuristic (any buffers present +
>10 rows), but this could be refined in follow-up PRs to use more sophisticated
waste ratio calculations similar to Arrow's BatchCoalescer.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]