Re: [DISCUSS] Adding new columnar memory layouts to Arrow (in-memory, IPC, C ABI)

Wes McKinney Tue, 14 Dec 2021 15:22:14 -0800

hi folks,

A few things in the general discussion, before certain things will
have to be split off into their own dedicated discussions.

It seems that I didn't do a very good job of motivating the "sequence
view" type. Let me take a step back and discuss one of the problems
these new memory layouts are solving.

In Arrow currently, selection operations ("take", "filter", or
indirect sort — the equivalent of arr.take(argsort(something_else)) if
you're coming from NumPy) have time complexity proportional to the
number of records for primitive types and complexity proportional to
the greater of max(# records, memory size) for nested types.

So, for example:

* Take(arr, indices) has O(# records) complexity for primitive types
and does O(# records) memory allocation
* Take(arr, indices) has O(max(# records, size of memory buffers /
child arrays)) complexity for strings and nested types and does O(size
of memory buffers) memory allocation

This means that columnar query engines that leverage selections can
experience heavy costs both in time complexity and memory use when
doing selections on non-primitive array data. Selections may arise
from filtering or sorting or other operations.

The "String view" and "Sequence view" memory layouts in this document
do not have this problem. When using these for strings and nested
data, they have the same time complexity and memory allocation
behavior for selections as primitive types, and the "child" memory
buffers do not have to be manipulated or rebuilt at all. This has
significant performance benefits and reduced memory use.

Additionally, the string view and sequence view layouts solve the
problem of out-of-order construction. As has been pointed out, one way
to work around this issue at present is to use "chunked arrays".
However, this means that you cannot ever use thread parallelism in the
construction of non-chunked outputs with nested data (for example, in
expression evaluation) — if a nested array forms part of a record
batch, then either you must stick to single-threaded execution or use
thread parallelism to subdivide even the other fields of the record
batch that are non-nested to obtain equal-sized arrays across all
fields. For example, if you had a record batch with 32K rows and
wanted to parallelize execution of a projection using 4 threads — you
would need to divide all fields into chunks of 8K each prior to
beginning to produce outputs. This is fairly inflexible.

As another motivating example, consider a parallel selection operation
(e.g. "take" or "filter") on a nested array. Currently it is not
possible to parallelize at all because of the in-order construction
requirement.

I don't expect you to just trust me — here is an example:

https://gist.github.com/wesm/25fc7b877f913c7e4449117178302646

In this example, I use Take to permute 1M doubles and 1M strings with
50 bytes each

* Doubles: 2.45ms (new memory allocated: 8000000)
* Strings: 39.6ms (new memory allocated: 54000000)

The performance ratio is 16x even though the memory ratio is only ~7x.
With the "StringView" data type, only 16000000 bytes of new memory
would need to be allocated, and the performance should be only 2-4x
slower than the doubles case (because we only need to relocate a bunch
of 16-byte structs) instead of 16x slower.

I hope you can see now that this can be a rather serious resource
utilization issue, both in processing time and memory use. I will
update the document to explain this better and work on responding to
some of the other comments.

Wes

On Tue, Dec 14, 2021 at 5:08 AM Antoine Pitrou <[email protected]> wrote:
>
>
> Hello,
>
> I think my main concern is how we can prevent the community from
> fragmenting too much over supported encodings.  The more complex the
> encodings, the less likely they are to be supported by all main
> implementations.  We see this in Parquet where the efficient "delta"
> encodings have just received support in Parquet C++, and even, only on
> the read side.
>
> There is an additional subtlety in that Arrow is not a storage mechanism
> but it represents data in memory, so pieces doing computation have to be
> adapted to the new encodings, for example the entire library of
> computation kernels in Arrow C++ (of course, an easy but inefficient
> adaptation is to always unpack to an already supported layout).
>
> As an anecdote, the Arrow C++ kernels are supposed to accept a selection
> vector to filter their physical inputs, but none actually supports it.
> I think we should be wary of adding ambitious new features that might
> never get an actual implementation.
>
>
> On the detail of the proposed encodings:
>
> - I hope we can avoid storing raw pointers instead of offsets into a
> separate buffer; I understand the flexibility argument for pointers but
> it will also make data transfer more complicated
>
> - Constant arrays are a special case of RLE arrays and I'm not sure
> doing both is really useful
>
> - I don't really understand the concrete use case for the weird
> "sequence view" layout; I'll note that non-monotonic offsets can make
> linear traversal less efficient, since the CPU won't automatically
> prefetch data for you
>
> - The proposed RLE encoding seems inefficient; usually, RLE encodings
> try hard to minimize the size overhead of RLE sequences, such that they
> become beneficial even for very short repeated runs
>
> Regards
>
> Antoine.
>
>
>
>
> Le 10/12/2021 à 20:28, Wes McKinney a écrit :
> >
> > This topic may provoke , but, given that Arrow is approaching its
> > 6-year anniversary, I think this is an important discussion about how
> > we can thoughtfully expand the Arrow specifications to support
> > next-generation columnar data processing. In recent times, I have been
> > motivated by recent interactions with CWI's DuckDB and Meta's Velox
> > open source projects and the innovations they've made around data
> > representation providing beneficial features above and beyond what we
> > have already in Arrow. For example, they have a 16-byte "string view"
> > data type that enables buffer memory reuse, faster "false" comparisons
> > on strings unequal in the first 4 bytes, and inline small strings.
> > Both the Rust and C++ query engine efforts could potentially benefit
> > from this (not sure about the memory safety implications in Rust,
> > comments around this would be helpful).
> >
> > I wrote a document to start a discussion about a few new ways to
> > represent data that may help with building
> > Arrow-native/Arrow-compatible query engines:
> >
> > https://docs.google.com/document/d/12aZi8Inez9L_JCtZ6gi2XDbQpCsHICNy9_EUxj4ILeE/edit#
> >
> > Each of these potential additions would need to be eventually split
> > off into independent efforts with associated additions to the columnar
> > specification, IPC format, C ABI, integration tests, and so on.
> >
> > The document is open to anyone to comment but if anyone would like
> > edit access please feel free to request and I look forward to the
> > discussion.
> >
> > Thanks,
> > Wes
> >

Re: [DISCUSS] Adding new columnar memory layouts to Arrow (in-memory, IPC, C ABI)

Reply via email to