Re: [I] Investigate performance tradeoff in compressing spill files [datafusion]

via GitHub Wed, 11 Jun 2025 22:44:54 -0700


2010YOUY01 commented on issue #16367:
URL: https://github.com/apache/datafusion/issues/16367#issuecomment-2964856125


   > * CPU / I/O tradeoff when `zstd` or `lz4_frame` compression is enabled 
i.e. compression ratio, extra latency spent for compression
   
   It's worth doing some micro-benches to see how those compressions work on 
Arrow arrays.
   We can configure several possible shapes of to-spill intermedia data (single 
col of different types/very wide batches/etc., perhaps just use the TPCH 
table), and test how different compression types perform for speed and 
compression ratio.
   It would be interesting to also test how `vortex` perform.
   
   > * Current arrow ipc stream writer always write `batch` at a time in 
`append_batch`. In terms of compression, it is not sure yet how much single 
batch can benefit from compression.
   
   This looks not so ideal, especially for compressing 'thin' batches with 
primitive types. If we can confirm there are huge overheads here then we should 
consider implementing compressing multiple batches at once.
   
   > * whether we need separate `Writer` or `Reader` implementation instead of 
IPC Stream Writer.
   > * how to introduce sort of `adaptiveness`.
   > 
   > ### Describe the solution you'd like
   > First, we need to track (or update) how many bytes are written in spill 
files. Datafusion currently tracks `spilled_bytes` as part of `SpillMetrics`, 
but it is calculated based on in memory array size, which would be different 
from actual spill files size especially when we compress spill files.
   > 
   > Second, update the benchmarks or write a separate benchmarks to see the 
performance characteristics. One possible way is writing out spill-related 
metrics to output.json when running benches like tpch with `debug` option. 
Another idea is to generate some spill files for microbenchmark testing only 
spill writing - reading process.
   
   🤔 Yes now the `spilled_bytes` is not implemented correctly if compression 
happens, it needs a follow-up fix. Adding more micro-benchmarks sounds great.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [I] Investigate performance tradeoff in compressing spill files [datafusion]

Reply via email to