Re: [PR] perf: improve json read [datafusion]

via GitHub Thu, 19 Mar 2026 09:32:04 -0700


ariel-miculas commented on PR #20823:
URL: https://github.com/apache/datafusion/pull/20823#issuecomment-4091483534


   No, I'm having troubles coming up with a realistic benchmark.
   
   The previous benchmark 
https://github.com/apache/datafusion/pull/19687/changes#diff-5358b38b6265d769b66b614f7ba88ed9320f7a9fce5197330b7c01c2a8a3ed3b
 incorrectly assumes that all the requested bytes (via get_opts) will be read, 
while you can actually request a 10GiB stream of bytes and read only 16KiB from 
it.
   
   As a result, the benchmark of the previous PR for reducing the read 
amplification shows impressive improvements, but it hides the fact that it 
breaks the parallelization between data fetching and json decoding (by doing 
all the data fetching in the JsonOpener instead of allowing FileStream to do 
its magic).
   
   So I'm not sure how to write a benchmark that can prove at the same time 
that:
   * I'm increasing performance (because there are no more read requests in the 
JsonOpener)
   * This solution is better than the original proposal 
https://github.com/apache/datafusion/pull/19687 because it doesn't break 
parallelization between fetching and decoding
   * This optimization is relevant for real-world object store implementations 
(where network latency matters, network speed matters, data computation can 
happen while waiting for bytes to be read, read-ahead is a relevant 
optimization etc.)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] perf: improve json read [datafusion]

Reply via email to