On Fri, Jun 18, 2021 at 1:12 AM Micah Kornfield <[email protected]> wrote: > > > > > Is it to ensure O(1) random access (instead of having to sum all > > deltas up to the index)? > > > This is my understanding of why it was chosen.
Yes, that's the reason. For example, certain columnar query processing patterns (e.g. selection vectors) depend on random access. We made the stipulation that all Arrow data types would support O(1) random access to broaden use cases. > > On Thu, Jun 17, 2021 at 10:32 PM Jorge Cardoso Leitão < > [email protected]> wrote: > > > Hi, > > > > (this has no direction; I am just genuinely curious) > > > > I am wondering, what is the rational to use "offsets" instead of > > "lengths" to represent variable sized arrays? > > > > I.e. ["a", "", None, "ab"] is represented as > > > > offsets: [0, 1, 1, 1, 3] > > values: "aab" > > > > what is the reasoning to use this over > > > > lengths: [1, 0, 0, 2] > > values: "aab" > > > > I am asking this because I have seen people using the LargeUtf8 type, > > or breaking Record batches in chunks, to avoid hitting the ceiling of > > i32 of large arrays with strings. > > > > Is it to ensure O(1) random access (instead of having to sum all > > deltas up to the index)? > > > > Best, > > Jorge > >
