Re: [DISCUSS] Add arrow.range canonical extension type for bounded ranges

Hoeze Mon, 25 May 2026 07:54:58 -0700

Yes, you're right, the current proposal would probably not be
sufficient for continuous PostgreSQL ranges.


Column level boundary flags were intentional as it allows to check
closedness in the schema instead of during runtime. This is also how
Pandas' `IntervalArray`/`IntervalIndex` works.

PostgreSQL's built-in discrete ranges (`int4range`, `int8range`,
`daterange`) canonicalize to left-closed intervals; here my proposal
would be sufficient. However, continuous ranges (`numrange`,
`tsrange`, `tstzrange`, ...) cannot be canonicalized. In this case
my proposal would indeed not be flexible enough.

I could imagine a number of possible solutions to this shortcoming:

  * Union type of all four closedness versions:
    Possible but not very elegant. Would shift the implementation
    burden towards the applications, that have to support union
    types.

  * Create a separate canonical data type for per-value boundary
    flags:
    Storage type `Struct<lower: T, upper: T, lower_inc: bool,
    upper_inc: bool>`, mirroring PostgreSQL's internal
    representation. Both types would coexist: `arrow.range` for the
    uniform case (and for canonicalized discrete PostgreSQL ranges),
    and e.g. `arrow.range_inc` for continuous (PostgreSQL) ranges.

  * Extend `arrow.range` itself with a per-value mode:
    Keep a single extension type, but allow
    `{"closed": "per_value"}` in the metadata, in which case the
    storage struct gains two boolean fields `lower_inc` and
    `upper_inc`. One extension name, two storage layouts. Simpler
    from a type-registry standpoint, slightly more conditional logic
    in implementations.

  * Always store per-value flags:
    Drop the metadata key entirely and always use
    `Struct<lower: T, upper: T, lower_inc: bool, upper_inc: bool>`.
    Two extra bytes per row uncompressed, but highly RLE/dictionary-
    friendly when uniform (which it usually is). Maximally simple to
    specify, at the cost of some overhead in the common pandas-style
    case.

I currently lean towards the second option, as it preserves the
schema-level check for the common case while still giving continuous,
per-value closedness ranges a lossless path. Fixed-shape tensor vs.
variable-shape tensor extension types went the same route. The main
alternative would be option 3, but a single extension name covering
two storage layouts ties the layout to a JSON metadata field rather
than to the type name itself, which is easier for downstream tooling
to get wrong I believe. What do you think?

Best,
Hoeze


Am 25.05.26 um 05:27 schrieb Curt Hagenlocher:

 From what I can tell, this would not be sufficiently flexible to store
PostgreSQL range columns for which the boundary flags are per-value and not
per-column. Is this intentional?

On Sun, May 24, 2026 at 4:11 PM Florian R. Hölzlwimmer <
[email protected]> wrote:

Hi all,

Following a suggestion from @rok on GitHub, I'd like to open a discussion
about adding a canonical extension type for bounded ranges to Arrow.

Background
==========

So far, Arrow has no canonical way to represent a bounded range (a
mathematical interval with a lower and an upper endpoint), e.g. a numeric
range `[0, 10)`, a date range, or a timestamp period. Today such data is
modeled ad hoc with two separate columns or with system-specific extension
types, which hurts interoperability. A canonical range type will be useful
to libraries like Pandas, Polars/Polars-bio, IRanges/PyRanges, database
connectors, ...

This is distinct from Arrow's existing calendar `Interval` type
(`INTERVAL_MONTHS` / `INTERVAL_DAY_TIME` / `INTERVAL_MONTH_DAY_NANO`),
which represents a duration (a signed amount of time), not a bounded set.
Databases like PostgreSQL make the same distinction: SQL uses `INTERVAL`
for durations and `RANGE` / `PERIOD` for bounded sets. This proposal
follows that convention by naming the type `arrow.range`.


Proposed design
===============

    * Extension name: `arrow.range`.

    * Storage type: `Struct<lower: T, upper: T>`. When subtype `T` is
nullable, a null bound represents an unbounded (infinite) endpoint.

        * Field names `lower` / `upper` follow PostgreSQL's convention for
ordering clarity (Pandas uses `left` / `right`).
        * The subtype `T` may be any orderable Arrow type (the numeric,
temporal and decimal families, etc.). Nested or non-comparable types are
out of scope.

    * Metadata: a JSON object `{"closed": "..."}` where `closed` is one of
`left`, `right`, `both`, `neither` (pandas vocabulary; `left` = lower
inclusive / upper exclusive, etc.). Required on the wire so a serialized
`arrow.range` is always unambiguous. Unknown JSON keys are ignored for
forward compatibility.

    * A range is empty implicitly when `lower > upper`, or when `lower ==
upper` with at least one bound exclusive. A range with `lower > upper` is
therefore valid (it denotes the empty set), not an error.

This mirrors pandas' interval support closely enough that `arrow.range`
would give `pandas.IntervalArray` / `IntervalIndex` a natural, lossless
Arrow representation for round-tripping.


References
==========

    - Full proposal and rationale:
      https://github.com/apache/arrow/issues/50027
    - Draft C++/Python implementation:
      https://github.com/apache/arrow/pull/50028

I'd appreciate any feedback on the overall direction and on the specific
design choices: field naming (`lower`/`upper` vs. `left`/`right`), the
`closed` parameter, the treatment of unbounded endpoints via nullability,
and the set of supported subtypes.

Many thanks,
Hoeze

Re: [DISCUSS] Add arrow.range canonical extension type for bounded ranges

Reply via email to