alamb commented on issue #16841: URL: https://github.com/apache/datafusion/issues/16841#issuecomment-3355803139
> I started some work on this in https://github.com/apache/datafusion/pull/17758 but really I think the ideal scenario is a config along the lines of "buffer up to 1GB of data from DataSourceExec" and then that can be opening 1024 files that produce 1kB of data each or 1 file that produces a 2GB RecordBatch on each poll. I have been thinking a lot about this too in relation to a push parquet decoder - https://github.com/apache/arrow-rs/issues/7983 Currently, the arrow-rs parquet decoder decodes in the unit of a row group (basically it fetches all data pages in a row group needed for reading a column, after accounting for predicates). This does limit the number of IOs, which is important for reading from object storage, but it also means that the minimum memory usage is a function of how large the row group was, which means it can't be limited to some set amount easily I am hoping that once we have a push decoder working (I hope to get this ready for review later this week) it will finally be feasible to allow decoding a subset of the row groups to trade off # of requests with memory usage -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
