adriangb commented on issue #16841: URL: https://github.com/apache/datafusion/issues/16841#issuecomment-3353411281
One place that IMO would really benefit from using the new arrow APIs is IO pipeline width. We've found that being able to go wide on IO is *extremely* useful for query latency so that we can saturate the bandwidth of our nodes on bandwidth constrained queries (e.g. point queries where it is necessary to filter GBs of data in memory to produce only 1 row). Currently in DataFusion that is (1) tied to the CPU work width and (2) has no way to account for the selectivity of the query in terms of rows or columns. I started some work on this in https://github.com/apache/datafusion/pull/17758 but really I think the ideal scenario is a config along the lines of "buffer up to 1GB of data from DataSourceExec" and then that can be opening 1024 files that produce 1kB of data each or 1 file that produces a 2GB RecordBatch on each poll. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
