>From what I can tell, this would not be sufficiently flexible to store PostgreSQL range columns for which the boundary flags are per-value and not per-column. Is this intentional?
On Sun, May 24, 2026 at 4:11 PM Florian R. Hölzlwimmer < [email protected]> wrote: > Hi all, > > Following a suggestion from @rok on GitHub, I'd like to open a discussion > about adding a canonical extension type for bounded ranges to Arrow. > > Background > ========== > > So far, Arrow has no canonical way to represent a bounded range (a > mathematical interval with a lower and an upper endpoint), e.g. a numeric > range `[0, 10)`, a date range, or a timestamp period. Today such data is > modeled ad hoc with two separate columns or with system-specific extension > types, which hurts interoperability. A canonical range type will be useful > to libraries like Pandas, Polars/Polars-bio, IRanges/PyRanges, database > connectors, ... > > This is distinct from Arrow's existing calendar `Interval` type > (`INTERVAL_MONTHS` / `INTERVAL_DAY_TIME` / `INTERVAL_MONTH_DAY_NANO`), > which represents a duration (a signed amount of time), not a bounded set. > Databases like PostgreSQL make the same distinction: SQL uses `INTERVAL` > for durations and `RANGE` / `PERIOD` for bounded sets. This proposal > follows that convention by naming the type `arrow.range`. > > > Proposed design > =============== > > * Extension name: `arrow.range`. > > * Storage type: `Struct<lower: T, upper: T>`. When subtype `T` is > nullable, a null bound represents an unbounded (infinite) endpoint. > > * Field names `lower` / `upper` follow PostgreSQL's convention for > ordering clarity (Pandas uses `left` / `right`). > * The subtype `T` may be any orderable Arrow type (the numeric, > temporal and decimal families, etc.). Nested or non-comparable types are > out of scope. > > * Metadata: a JSON object `{"closed": "..."}` where `closed` is one of > `left`, `right`, `both`, `neither` (pandas vocabulary; `left` = lower > inclusive / upper exclusive, etc.). Required on the wire so a serialized > `arrow.range` is always unambiguous. Unknown JSON keys are ignored for > forward compatibility. > > * A range is empty implicitly when `lower > upper`, or when `lower == > upper` with at least one bound exclusive. A range with `lower > upper` is > therefore valid (it denotes the empty set), not an error. > > This mirrors pandas' interval support closely enough that `arrow.range` > would give `pandas.IntervalArray` / `IntervalIndex` a natural, lossless > Arrow representation for round-tripping. > > > References > ========== > > - Full proposal and rationale: > https://github.com/apache/arrow/issues/50027 > - Draft C++/Python implementation: > https://github.com/apache/arrow/pull/50028 > > I'd appreciate any feedback on the overall direction and on the specific > design choices: field naming (`lower`/`upper` vs. `left`/`right`), the > `closed` parameter, the treatment of unbounded endpoints via nullability, > and the set of supported subtypes. > > Many thanks, > Hoeze >
