Re: [DISCUSS] Add arrow.range canonical extension type for bounded ranges

Curt Hagenlocher Sun, 24 May 2026 20:28:18 -0700

>From what I can tell, this would not be sufficiently flexible to store
PostgreSQL range columns for which the boundary flags are per-value and not
per-column. Is this intentional?


On Sun, May 24, 2026 at 4:11 PM Florian R. Hölzlwimmer <
[email protected]> wrote:

> Hi all,
>
> Following a suggestion from @rok on GitHub, I'd like to open a discussion
> about adding a canonical extension type for bounded ranges to Arrow.
>
> Background
> ==========
>
> So far, Arrow has no canonical way to represent a bounded range (a
> mathematical interval with a lower and an upper endpoint), e.g. a numeric
> range `[0, 10)`, a date range, or a timestamp period. Today such data is
> modeled ad hoc with two separate columns or with system-specific extension
> types, which hurts interoperability. A canonical range type will be useful
> to libraries like Pandas, Polars/Polars-bio, IRanges/PyRanges, database
> connectors, ...
>
> This is distinct from Arrow's existing calendar `Interval` type
> (`INTERVAL_MONTHS` / `INTERVAL_DAY_TIME` / `INTERVAL_MONTH_DAY_NANO`),
> which represents a duration (a signed amount of time), not a bounded set.
> Databases like PostgreSQL make the same distinction: SQL uses `INTERVAL`
> for durations and `RANGE` / `PERIOD` for bounded sets. This proposal
> follows that convention by naming the type `arrow.range`.
>
>
> Proposed design
> ===============
>
>    * Extension name: `arrow.range`.
>
>    * Storage type: `Struct<lower: T, upper: T>`. When subtype `T` is
> nullable, a null bound represents an unbounded (infinite) endpoint.
>
>        * Field names `lower` / `upper` follow PostgreSQL's convention for
> ordering clarity (Pandas uses `left` / `right`).
>        * The subtype `T` may be any orderable Arrow type (the numeric,
> temporal and decimal families, etc.). Nested or non-comparable types are
> out of scope.
>
>    * Metadata: a JSON object `{"closed": "..."}` where `closed` is one of
> `left`, `right`, `both`, `neither` (pandas vocabulary; `left` = lower
> inclusive / upper exclusive, etc.). Required on the wire so a serialized
> `arrow.range` is always unambiguous. Unknown JSON keys are ignored for
> forward compatibility.
>
>    * A range is empty implicitly when `lower > upper`, or when `lower ==
> upper` with at least one bound exclusive. A range with `lower > upper` is
> therefore valid (it denotes the empty set), not an error.
>
> This mirrors pandas' interval support closely enough that `arrow.range`
> would give `pandas.IntervalArray` / `IntervalIndex` a natural, lossless
> Arrow representation for round-tripping.
>
>
> References
> ==========
>
>    - Full proposal and rationale:
>      https://github.com/apache/arrow/issues/50027
>    - Draft C++/Python implementation:
>      https://github.com/apache/arrow/pull/50028
>
> I'd appreciate any feedback on the overall direction and on the specific
> design choices: field naming (`lower`/`upper` vs. `left`/`right`), the
> `closed` parameter, the treatment of unbounded endpoints via nullability,
> and the set of supported subtypes.
>
> Many thanks,
> Hoeze
>

Re: [DISCUSS] Add arrow.range canonical extension type for bounded ranges

Reply via email to