Re: [Question] Rational for offsets instead of deltas

Wes McKinney Fri, 18 Jun 2021 07:11:00 -0700

On Fri, Jun 18, 2021 at 1:12 AM Micah Kornfield <[email protected]> wrote:
>
> >
> > Is it to ensure O(1) random access (instead of having to sum all
> > deltas up to the index)?
>
>
> This is my understanding of why it was chosen.


Yes, that's the reason. For example, certain columnar query processing
patterns (e.g. selection vectors) depend on random access. We made the
stipulation that all Arrow data types would support O(1) random access
to broaden use cases.

>
> On Thu, Jun 17, 2021 at 10:32 PM Jorge Cardoso Leitão <
> [email protected]> wrote:
>
> > Hi,
> >
> > (this has no direction; I am just genuinely curious)
> >
> > I am wondering, what is the rational to use "offsets" instead of
> > "lengths" to represent variable sized arrays?
> >
> > I.e. ["a", "", None, "ab"] is represented as
> >
> > offsets: [0, 1, 1, 1, 3]
> > values: "aab"
> >
> > what is the reasoning to use this over
> >
> > lengths: [1, 0, 0, 2]
> > values: "aab"
> >
> > I am asking this because I have seen people using the LargeUtf8 type,
> > or breaking Record batches in chunks, to avoid hitting the ceiling of
> > i32 of large arrays with strings.
> >
> > Is it to ensure O(1) random access (instead of having to sum all
> > deltas up to the index)?
> >
> > Best,
> > Jorge
> >

Re: [Question] Rational for offsets instead of deltas

Reply via email to