Re: Arrow Flight Endpoint Location URLs

Rusty Conover Wed, 26 Mar 2025 18:04:56 -0700

Hi Jacob and Matt,

I appreciate the opportunity to review the document. It helped clarify some
things for me, but I’d like to propose an approach that is slightly
different while still aligning with the overall goals. That said, I know
you all have far more experience designing and implementing ideas around
Arrow Flight, so I appreciate your patience if this isn’t the best idea.


Motivation

My goal is to enable Arrow Flight to reference data from arbitrary
locations and formats. The main reasons for this are:

1. *Efficiency and Scalability* – Running Arrow Flight servers can be
resource-intensive and difficult to scale. A CDN allows for a pay-per-byte
model, whereas an Arrow Flight server incurs costs for both CPU and data
transfer.
2. *Flexible Data Access* – Some datasets are static, while others require
streaming from hot storage. For example, an event store may write data to
Parquet on S3 after five minutes while keeping recent events in memory.
Ideally, clients could be pointed directly to the Parquet files when
appropriate.
3. *Format Flexibility* – Arrow IPC is great, but it’s not always the most
efficient format for every use case, especially when data is already stored
in other structured formats. While I ❤️ Arrow IPC, I’d like to support
additional formats where it makes sense.

Proposed Approach

Like the proposal suggests, I’d like to extend FlightLocation URIs to
provide richer context about data sources. Specifically, I propose two
extensions using `data://` URIs within FlightEndpoints:


*1. Referencing Arbitrary Data Formats and Lakehouse Tables  *
Flight URIs should be able to reference entire Delta Lake or Iceberg tables
using their storage locations. This would also support any data format that
DuckDB can read (e.g., Parquet, JSON, CSV) over various storage protocols
(HTTPS, S3, GCS, Azure).

This means a Flight’s contents could be a combination of a Delta
Lake/Iceberg table (at a specific snapshot) and traditional Flight
endpoints returning Arrow IPC.

Example JSON for a Delta Lake table:

```json
{
  "format": "delta-lake",
  "location": "s3://example-bucket/example-table",
  "options": { "snapshot_id": "snapshot" }
}
```

Or for a CSV file:

```json
{
  "format": "csv",
  "location": "s3://example-bucket/example-data.csv",
  "options": { "separator": "\t" }
}
```


*2. Including Topological Metadata in Location URIs  *
Currently, FlightLocation URIs provide little network topological context
beyond the resource location, leaving clients to determine the most
efficient source. It would be useful to include metadata that helps clients
make better decisions.

One approach is to use a `data://` URI containing additional metadata:

```json
{
  "topology": { "datacenter": "dft" },
  "format": "csv",
  "location": "file:///mnt/nfs/example-data.csv"
}
```

These `data://` URIs would follow RFC 2397 for encoding.

If an Arrow client doesn’t support or recognize a `data://` URI, it should
fall back to another available location—such as a standard Arrow Flight
server that converts data on demand. If no viable locations exist, an ex
ception should be raised.

Next Steps

I’d love to hear your thoughts on this approach and how it might integrate
with the existing proposal. I haven’t yet tested whether `data://` URIs can
be parsed directly into Flight Locations, but I believe modifying the code
to support this should be straightforward if necessary.

Cheers,
Rusty

P.S. My focus is primarily on DuckDB/Flight integration (
https://github.com/Query-farm/duckdb-airport-extension), which is why
DuckDB support is a key part of this proposal.

Re: Arrow Flight Endpoint Location URLs

Reply via email to