Re: [I] [Epic] Split datasources out from `datafusion` crate (`datafusion/core`) [datafusion]

via GitHub Mon, 17 Mar 2025 04:47:36 -0700


AdamGS commented on issue #14444:
URL: https://github.com/apache/datafusion/issues/14444#issuecomment-2729189144


   I was out for a couple of weeks on vacation and had some time to think, and 
what I came up with is to build this layer and maybe parts of IO around some 
abstracted columnar format, so more parts of these interfaces will be built on 
top of that.
   Today the Parquet integration is way more complex than any of the other 
row-oriented formats, and when implementing 
[`Vortex`](https://github.com/spiraldb/vortex) support I find myself drawing 
inspiration from it, but much of that might as well be shared logic in 
datafusion.
   
   I think @tustvold's 
[comment](https://github.com/apache/datafusion/pull/15018#issuecomment-2708822164)
 with regards to a new storage interface also touches it. We have opportunities 
to move the level of abstraction in a way that improves extendability AND make 
the system more performant at the same time. I've also been looking into 
caching of higher-level objects (like `ParquetMetaData`) and it seems likely 
that objects like will will exist for other columnar formats (schema + stats + 
some general file structure). 
   Maybe columnar and row-oriented formats can be split into different traits 
(maybe still sub-traits of `FileSource`), so we can abstract each one better 
(the way CSV and JSON already share the `Decoder` trait).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [I] [Epic] Split datasources out from `datafusion` crate (`datafusion/core`) [datafusion]

Reply via email to