This feels like it's trying to use Flight as a catalog service. On Wed, Mar 26, 2025 at 6:04 PM Rusty Conover <ru...@conover.me.invalid> wrote:
> Hi Jacob and Matt, > > I appreciate the opportunity to review the document. It helped clarify some > things for me, but I’d like to propose an approach that is slightly > different while still aligning with the overall goals. That said, I know > you all have far more experience designing and implementing ideas around > Arrow Flight, so I appreciate your patience if this isn’t the best idea. > > Motivation > > My goal is to enable Arrow Flight to reference data from arbitrary > locations and formats. The main reasons for this are: > > 1. *Efficiency and Scalability* – Running Arrow Flight servers can be > resource-intensive and difficult to scale. A CDN allows for a pay-per-byte > model, whereas an Arrow Flight server incurs costs for both CPU and data > transfer. > 2. *Flexible Data Access* – Some datasets are static, while others require > streaming from hot storage. For example, an event store may write data to > Parquet on S3 after five minutes while keeping recent events in memory. > Ideally, clients could be pointed directly to the Parquet files when > appropriate. > 3. *Format Flexibility* – Arrow IPC is great, but it’s not always the most > efficient format for every use case, especially when data is already stored > in other structured formats. While I ❤️ Arrow IPC, I’d like to support > additional formats where it makes sense. > > Proposed Approach > > Like the proposal suggests, I’d like to extend FlightLocation URIs to > provide richer context about data sources. Specifically, I propose two > extensions using `data://` URIs within FlightEndpoints: > > > *1. Referencing Arbitrary Data Formats and Lakehouse Tables * > Flight URIs should be able to reference entire Delta Lake or Iceberg tables > using their storage locations. This would also support any data format that > DuckDB can read (e.g., Parquet, JSON, CSV) over various storage protocols > (HTTPS, S3, GCS, Azure). > > This means a Flight’s contents could be a combination of a Delta > Lake/Iceberg table (at a specific snapshot) and traditional Flight > endpoints returning Arrow IPC. > > Example JSON for a Delta Lake table: > > ```json > { > "format": "delta-lake", > "location": "s3://example-bucket/example-table", > "options": { "snapshot_id": "snapshot" } > } > ``` > > Or for a CSV file: > > ```json > { > "format": "csv", > "location": "s3://example-bucket/example-data.csv", > "options": { "separator": "\t" } > } > ``` > > > *2. Including Topological Metadata in Location URIs * > Currently, FlightLocation URIs provide little network topological context > beyond the resource location, leaving clients to determine the most > efficient source. It would be useful to include metadata that helps clients > make better decisions. > > One approach is to use a `data://` URI containing additional metadata: > > ```json > { > "topology": { "datacenter": "dft" }, > "format": "csv", > "location": "file:///mnt/nfs/example-data.csv" > } > ``` > > These `data://` URIs would follow RFC 2397 for encoding. > > If an Arrow client doesn’t support or recognize a `data://` URI, it should > fall back to another available location—such as a standard Arrow Flight > server that converts data on demand. If no viable locations exist, an ex > ception should be raised. > > Next Steps > > I’d love to hear your thoughts on this approach and how it might integrate > with the existing proposal. I haven’t yet tested whether `data://` URIs can > be parsed directly into Flight Locations, but I believe modifying the code > to support this should be straightforward if necessary. > > Cheers, > Rusty > > P.S. My focus is primarily on DuckDB/Flight integration ( > https://github.com/Query-farm/duckdb-airport-extension), which is why > DuckDB support is a key part of this proposal. >