Hi Jacob and Matt, I appreciate the opportunity to review the document. It helped clarify some things for me, but I’d like to propose an approach that is slightly different while still aligning with the overall goals. That said, I know you all have far more experience designing and implementing ideas around Arrow Flight, so I appreciate your patience if this isn’t the best idea.
Motivation My goal is to enable Arrow Flight to reference data from arbitrary locations and formats. The main reasons for this are: 1. *Efficiency and Scalability* – Running Arrow Flight servers can be resource-intensive and difficult to scale. A CDN allows for a pay-per-byte model, whereas an Arrow Flight server incurs costs for both CPU and data transfer. 2. *Flexible Data Access* – Some datasets are static, while others require streaming from hot storage. For example, an event store may write data to Parquet on S3 after five minutes while keeping recent events in memory. Ideally, clients could be pointed directly to the Parquet files when appropriate. 3. *Format Flexibility* – Arrow IPC is great, but it’s not always the most efficient format for every use case, especially when data is already stored in other structured formats. While I ❤️ Arrow IPC, I’d like to support additional formats where it makes sense. Proposed Approach Like the proposal suggests, I’d like to extend FlightLocation URIs to provide richer context about data sources. Specifically, I propose two extensions using `data://` URIs within FlightEndpoints: *1. Referencing Arbitrary Data Formats and Lakehouse Tables * Flight URIs should be able to reference entire Delta Lake or Iceberg tables using their storage locations. This would also support any data format that DuckDB can read (e.g., Parquet, JSON, CSV) over various storage protocols (HTTPS, S3, GCS, Azure). This means a Flight’s contents could be a combination of a Delta Lake/Iceberg table (at a specific snapshot) and traditional Flight endpoints returning Arrow IPC. Example JSON for a Delta Lake table: ```json { "format": "delta-lake", "location": "s3://example-bucket/example-table", "options": { "snapshot_id": "snapshot" } } ``` Or for a CSV file: ```json { "format": "csv", "location": "s3://example-bucket/example-data.csv", "options": { "separator": "\t" } } ``` *2. Including Topological Metadata in Location URIs * Currently, FlightLocation URIs provide little network topological context beyond the resource location, leaving clients to determine the most efficient source. It would be useful to include metadata that helps clients make better decisions. One approach is to use a `data://` URI containing additional metadata: ```json { "topology": { "datacenter": "dft" }, "format": "csv", "location": "file:///mnt/nfs/example-data.csv" } ``` These `data://` URIs would follow RFC 2397 for encoding. If an Arrow client doesn’t support or recognize a `data://` URI, it should fall back to another available location—such as a standard Arrow Flight server that converts data on demand. If no viable locations exist, an ex ception should be raised. Next Steps I’d love to hear your thoughts on this approach and how it might integrate with the existing proposal. I haven’t yet tested whether `data://` URIs can be parsed directly into Flight Locations, but I believe modifying the code to support this should be straightforward if necessary. Cheers, Rusty P.S. My focus is primarily on DuckDB/Flight integration ( https://github.com/Query-farm/duckdb-airport-extension), which is why DuckDB support is a key part of this proposal.