AdamGS commented on issue #14444: URL: https://github.com/apache/datafusion/issues/14444#issuecomment-2729189144
I was out for a couple of weeks on vacation and had some time to think, and what I came up with is to build this layer and maybe parts of IO around some abstracted columnar format, so more parts of these interfaces will be built on top of that. Today the Parquet integration is way more complex than any of the other row-oriented formats, and when implementing [`Vortex`](https://github.com/spiraldb/vortex) support I find myself drawing inspiration from it, but much of that might as well be shared logic in datafusion. I think @tustvold's [comment](https://github.com/apache/datafusion/pull/15018#issuecomment-2708822164) with regards to a new storage interface also touches it. We have opportunities to move the level of abstraction in a way that improves extendability AND make the system more performant at the same time. I've also been looking into caching of higher-level objects (like `ParquetMetaData`) and it seems likely that objects like will will exist for other columnar formats (schema + stats + some general file structure). Maybe columnar and row-oriented formats can be split into different traits (maybe still sub-traits of `FileSource`), so we can abstract each one better (the way CSV and JSON already share the `Decoder` trait). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org