This feels like it's trying to use Flight as a catalog service.

On Wed, Mar 26, 2025 at 6:04 PM Rusty Conover <ru...@conover.me.invalid>
wrote:

> Hi Jacob and Matt,
>
> I appreciate the opportunity to review the document. It helped clarify some
> things for me, but I’d like to propose an approach that is slightly
> different while still aligning with the overall goals. That said, I know
> you all have far more experience designing and implementing ideas around
> Arrow Flight, so I appreciate your patience if this isn’t the best idea.
>
> Motivation
>
> My goal is to enable Arrow Flight to reference data from arbitrary
> locations and formats. The main reasons for this are:
>
> 1. *Efficiency and Scalability* – Running Arrow Flight servers can be
> resource-intensive and difficult to scale. A CDN allows for a pay-per-byte
> model, whereas an Arrow Flight server incurs costs for both CPU and data
> transfer.
> 2. *Flexible Data Access* – Some datasets are static, while others require
> streaming from hot storage. For example, an event store may write data to
> Parquet on S3 after five minutes while keeping recent events in memory.
> Ideally, clients could be pointed directly to the Parquet files when
> appropriate.
> 3. *Format Flexibility* – Arrow IPC is great, but it’s not always the most
> efficient format for every use case, especially when data is already stored
> in other structured formats. While I ❤️ Arrow IPC, I’d like to support
> additional formats where it makes sense.
>
> Proposed Approach
>
> Like the proposal suggests, I’d like to extend FlightLocation URIs to
> provide richer context about data sources. Specifically, I propose two
> extensions using `data://` URIs within FlightEndpoints:
>
>
> *1. Referencing Arbitrary Data Formats and Lakehouse Tables  *
> Flight URIs should be able to reference entire Delta Lake or Iceberg tables
> using their storage locations. This would also support any data format that
> DuckDB can read (e.g., Parquet, JSON, CSV) over various storage protocols
> (HTTPS, S3, GCS, Azure).
>
> This means a Flight’s contents could be a combination of a Delta
> Lake/Iceberg table (at a specific snapshot) and traditional Flight
> endpoints returning Arrow IPC.
>
> Example JSON for a Delta Lake table:
>
> ```json
> {
>   "format": "delta-lake",
>   "location": "s3://example-bucket/example-table",
>   "options": { "snapshot_id": "snapshot" }
> }
> ```
>
> Or for a CSV file:
>
> ```json
> {
>   "format": "csv",
>   "location": "s3://example-bucket/example-data.csv",
>   "options": { "separator": "\t" }
> }
> ```
>
>
> *2. Including Topological Metadata in Location URIs  *
> Currently, FlightLocation URIs provide little network topological context
> beyond the resource location, leaving clients to determine the most
> efficient source. It would be useful to include metadata that helps clients
> make better decisions.
>
> One approach is to use a `data://` URI containing additional metadata:
>
> ```json
> {
>   "topology": { "datacenter": "dft" },
>   "format": "csv",
>   "location": "file:///mnt/nfs/example-data.csv"
> }
> ```
>
> These `data://` URIs would follow RFC 2397 for encoding.
>
> If an Arrow client doesn’t support or recognize a `data://` URI, it should
> fall back to another available location—such as a standard Arrow Flight
> server that converts data on demand. If no viable locations exist, an ex
> ception should be raised.
>
> Next Steps
>
> I’d love to hear your thoughts on this approach and how it might integrate
> with the existing proposal. I haven’t yet tested whether `data://` URIs can
> be parsed directly into Flight Locations, but I believe modifying the code
> to support this should be straightforward if necessary.
>
> Cheers,
> Rusty
>
> P.S. My focus is primarily on DuckDB/Flight integration (
> https://github.com/Query-farm/duckdb-airport-extension), which is why
> DuckDB support is a key part of this proposal.
>

Reply via email to