ariel-miculas commented on PR #20823: URL: https://github.com/apache/datafusion/pull/20823#issuecomment-4091483534
No, I'm having troubles coming up with a realistic benchmark. The previous benchmark https://github.com/apache/datafusion/pull/19687/changes#diff-5358b38b6265d769b66b614f7ba88ed9320f7a9fce5197330b7c01c2a8a3ed3b incorrectly assumes that all the requested bytes (via get_opts) will be read, while you can actually request a 10GiB stream of bytes and read only 16KiB from it. As a result, the benchmark of the previous PR for reducing the read amplification shows impressive improvements, but it hides the fact that it breaks the parallelization between data fetching and json decoding (by doing all the data fetching in the JsonOpener instead of allowing FileStream to do its magic). So I'm not sure how to write a benchmark that can prove at the same time that: * I'm increasing performance (because there are no more read requests in the JsonOpener) * This solution is better than the original proposal https://github.com/apache/datafusion/pull/19687 because it doesn't break parallelization between fetching and decoding * This optimization is relevant for real-world object store implementations (where network latency matters, network speed matters, data computation can happen while waiting for bytes to be read, read-ahead is a relevant optimization etc.) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
