I think the idea of DataFusion + DeltaLake is quite compelling and likely useful.
However, I think DataFusion is ideally an "embeddable query engine" rather than a database system in itself, so in that mental model Delta Lake integration belongs somewhere other than the core DataFusion crate. My ideal structure would be a new crate (maybe not even part of the Apache Arrow Project), perhaps called `datafusion-delta-rs`, that contained the TableProvider and whatever else was needed to integrate DataFusion with DeltaLake This structure could also start a pattern of publishing plugins for DataFusion separately from the core. Andrew p.s. now that Arrow is publishing more incrementally (e.g. 4.1.0, 4.2.0, etc), I think delta-rs[1] and datafusion both only specify `4.x` so they should work together nicely https://github.com/delta-io/delta-rs/blame/main/rust/Cargo.toml On Wed, Jun 9, 2021 at 2:29 AM Daniël Heres <danielhe...@gmail.com> wrote: > Hi all, > > I would like to receive some feedback about adding Delta Lake support to > DataFusion (https://github.com/apache/arrow-datafusion/issues/525). > As you might know, Delta Lake <https://delta.io/> is a format adding > features like ACID transactions, statistics, and storage optimization to > Parquet and is getting quite some traction for managing data lakes. > It seems a great feature to have in DataFusion as well. > > The delta-rs <https://github.com/delta-io/delta-rs> project provides a > native, Apache licensed, Rust implementation of Delta Lake, already > supporting a large part of the format and operations. > > The first integration I would like to propose is adding read support via a > new TableProvider. There might be some work to do around dependencies as > both DataFusion and delta-rs rely on (certain versions of) Arrow and > Parquet. > > Let me know if you have any further ideas or concerns. > > Best regards, > > Daniël Heres >