adriangb commented on issue #16841:
URL: https://github.com/apache/datafusion/issues/16841#issuecomment-3353411281

   One place that IMO would really benefit from using the new arrow APIs is IO 
pipeline width. We've found that being able to go wide on IO is *extremely* 
useful for query latency so that we can saturate the bandwidth of our nodes on 
bandwidth constrained queries (e.g. point queries where it is necessary to 
filter GBs of data in memory to produce only 1 row). Currently in DataFusion 
that is (1) tied to the CPU work width and (2) has no way to account for the 
selectivity of the query in terms of rows or columns. I started some work on 
this in https://github.com/apache/datafusion/pull/17758 but really I think the 
ideal scenario is a config along the lines of "buffer up to 1GB of data from 
DataSourceExec" and then that can be opening 1024 files that produce 1kB of 
data each or 1 file that produces a 2GB RecordBatch on each poll.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to