Re: [DISCUSS] Adding new columnar memory layouts to Arrow (in-memory, IPC, C ABI)

Micah Kornfield Wed, 15 Dec 2021 13:56:29 -0800

>
> Big +1 in replacing our current representation of variable-sized arrays by
> the "sequence view". atm I am -0.5 in adding it without removing the
> [Large]Utf8Array / Binary / List, as I see the advantages as sufficiently
> large to break compatibility and deprecate the previous representations
> (and do not enjoy maintaining multiple similar representations that solve
> very similar problems).



I don't think we should be deprecating anything in favor of these at a
format level.  There are many production systems built on the current
encodings, and we've publicly documented such changes as a rare occurrence
[1].  If we do adopt the new encodings I think it is fine if most systems
only want to deal with them on system boundaries and convert to one of the
new encodings.

[1] https://arrow.apache.org/docs/format/Versioning.html#long-term-stability

On Wed, Dec 15, 2021 at 1:35 PM Weston Pace <[email protected]> wrote:

> > I am -0.5 in adding it without removing the
> > [Large]Utf8Array / Binary / List
>
> I'm not sure about dropping List.
>
> Is SequenceView semantically equivalent to List / FixedSizeList?  In
> other words, is SequenceView a nested type?  The document seems to
> suggest it is but the use case you described does not.  For example,
> in the C++ compute today you cannot add List<INT32> + List<INT32> but
> I think you would want to be able to add SequenceView<INT32> +
> SequenceView<INT32>.  Also, the size of a List<INT32> is the # of
> lists and not the # of items.  For a SequenceView I think the size of
> the array would be the number of items.  I would also consider it a
> semantic change to go from Struct{"x": INT32} to Struct{"X":
> List<INT32>}.
>
> From the use case it sounds more like SequenceView would be similar to
> dictionary and RLE, a different encoding for existing arrays.
> However, it is possible I am misreading things.
>
> On Wed, Dec 15, 2021 at 10:49 AM Jorge Cardoso Leitão
> <[email protected]> wrote:
> >
> > Hi,
> >
> > Thanks a lot for this initiative and the write up.
> >
> > I did a small bench for the sequence view and added a graph to the
> document
> > for evidence of what Wes is writing wrt to performance of "selection /
> take
> > / filter".
> >
> > Big +1 in replacing our current representation of variable-sized arrays
> by
> > the "sequence view". atm I am -0.5 in adding it without removing the
> > [Large]Utf8Array / Binary / List, as I see the advantages as sufficiently
> > large to break compatibility and deprecate the previous representations
> > (and do not enjoy maintaining multiple similar representations that solve
> > very similar problems).
> >
> > Likewise, +1 for the RLE and -0.5 for the constant array, as the latter
> > seems redundant to me (it is an RLE).
> >
> > Wrt to the string view: would like to run some benches on that too. Could
> > someone clarify what are the "good cases" for that one?
> >
> > More generally, I second the point made by Antoine: there is already some
> > fragmentation over the types in the official implementations (see [1]),
> and
> > we do not even have a common integration test suite for the c data
> > interface. One approach to this dimension is to *deprecate*
> > representations, which goes into the direction mentioned above.
> >
> > Wrt to design, we could consider a separate enum for the RLE vs plain
> > encoding, as they are not really semantic types (the dictionary is also
> not
> > a semantic type but it is represented as one in at least the Rust
> > implementation, unfortunately).
> >
> > Wrt to Rust impl in particular, I do not think that the String View
> poses a
> > problem - Rust can layout according to the C representation. Here [2] is
> > the corresponding Rust code of the struct in the doc (generated via
> Rust's
> > bindgen [3]).
> >
> > Thanks again for this, looking very much forward to it!
> >
> > [1]
> >
> https://github.com/apache/arrow/blob/master/dev/archery/archery/integration/datagen.py#L1546
> > [2]
> >
> https://github.com/DataEngineeringLabs/arrow-string-view/blob/main/src/string_view.rs
> > [3] https://rust-lang.github.io/rust-bindgen/command-line-usage.html
> >
> >
> > On Wed, Dec 15, 2021 at 3:15 AM Wes McKinney <[email protected]>
> wrote:
> >
> > > Ultimately, the problem comes down to providing a means of O(#
> > > records) selection (take, filter) performance and memory use for
> > > non-numeric data (strings, arrays, maps, etc.).
> > >
> > > DuckDB and Velox are two projects which have designed themselves to be
> > > very nearly Arrow-compatible but have implemented alternative memory
> > > layouts to achieve O(# records) selections on all data types. I am
> > > proposing to adopt these innovations as additional memory layouts in
> > > Arrow with a target of zero-copy across the C ABI — how exactly they
> > > are translated to the IPC format seems less of an immediate benefit
> > > than enabling the in-memory performance/memory use optimization since
> > > query engines can accelerate performance with faster selections. If
> > > there are some alternative proposals to achieve O(# records) time and
> > > space complexity for selection operations, let's definitely look at
> > > them.
> > >
> > >
> > > On Tue, Dec 14, 2021 at 8:02 PM Weston Pace <[email protected]>
> wrote:
> > > >
> > > > Would it be simpler to change the spec so that child arrays can be
> > > > chunked?  This might reduce the data type growth and make the intent
> > > > more clear.
> > > >
> > > > This will add another dimension to performance analysis.  We pretty
> > > > regularly get issues/tickets from users that have unknowingly created
> > > > parquet files with poor row group resolution (e.g. 50 rows per row
> > > > group) and experience rotten performance as a result.  I suspect
> > > > something similar could happen here.  It sounds like arrays will
> > > > naturally subdivide over time.  Users might start seeing poor
> > > > performance without realizing the root cause is because their 1
> > > > million element array has been split into 10,000 allocations of 100
> > > > elements.  However, I suspect this is something that could be managed
> > > > with visibility and recompaction utilities.
> > > >
> > > >
> > > > On Tue, Dec 14, 2021 at 1:22 PM Wes McKinney <[email protected]>
> > > wrote:
> > > > >
> > > > > hi folks,
> > > > >
> > > > > A few things in the general discussion, before certain things will
> > > > > have to be split off into their own dedicated discussions.
> > > > >
> > > > > It seems that I didn't do a very good job of motivating the
> "sequence
> > > > > view" type. Let me take a step back and discuss one of the problems
> > > > > these new memory layouts are solving.
> > > > >
> > > > > In Arrow currently, selection operations ("take", "filter", or
> > > > > indirect sort — the equivalent of
> arr.take(argsort(something_else)) if
> > > > > you're coming from NumPy) have time complexity proportional to the
> > > > > number of records for primitive types and complexity proportional
> to
> > > > > the greater of max(# records, memory size) for nested types.
> > > > >
> > > > > So, for example:
> > > > >
> > > > > * Take(arr, indices) has O(# records) complexity for primitive
> types
> > > > > and does O(# records) memory allocation
> > > > > * Take(arr, indices) has O(max(# records, size of memory buffers /
> > > > > child arrays)) complexity for strings and nested types and does
> O(size
> > > > > of memory buffers) memory allocation
> > > > >
> > > > > This means that columnar query engines that leverage selections can
> > > > > experience heavy costs both in time complexity and memory use when
> > > > > doing selections on non-primitive array data. Selections may arise
> > > > > from filtering or sorting or other operations.
> > > > >
> > > > > The "String view" and "Sequence view" memory layouts in this
> document
> > > > > do not have this problem. When using these for strings and nested
> > > > > data, they have the same time complexity and memory allocation
> > > > > behavior for selections as primitive types, and the "child" memory
> > > > > buffers do not have to be manipulated or rebuilt at all. This has
> > > > > significant performance benefits and reduced memory use.
> > > > >
> > > > > Additionally, the string view and sequence view layouts solve the
> > > > > problem of out-of-order construction. As has been pointed out, one
> way
> > > > > to work around this issue at present is to use "chunked arrays".
> > > > > However, this means that you cannot ever use thread parallelism in
> the
> > > > > construction of non-chunked outputs with nested data (for example,
> in
> > > > > expression evaluation) — if a nested array forms part of a record
> > > > > batch, then either you must stick to single-threaded execution or
> use
> > > > > thread parallelism to subdivide even the other fields of the record
> > > > > batch that are non-nested to obtain equal-sized arrays across all
> > > > > fields. For example, if you had a record batch with 32K rows and
> > > > > wanted to parallelize execution of a projection using 4 threads —
> you
> > > > > would need to divide all fields into chunks of 8K each prior to
> > > > > beginning to produce outputs. This is fairly inflexible.
> > > > >
> > > > > As another motivating example, consider a parallel selection
> operation
> > > > > (e.g. "take" or "filter") on a nested array. Currently it is not
> > > > > possible to parallelize at all because of the in-order construction
> > > > > requirement.
> > > > >
> > > > > I don't expect you to just trust me — here is an example:
> > > > >
> > > > > https://gist.github.com/wesm/25fc7b877f913c7e4449117178302646
> > > > >
> > > > > In this example, I use Take to permute 1M doubles and 1M strings
> with
> > > > > 50 bytes each
> > > > >
> > > > > * Doubles: 2.45ms (new memory allocated: 8000000)
> > > > > * Strings: 39.6ms (new memory allocated: 54000000)
> > > > >
> > > > > The performance ratio is 16x even though the memory ratio is only
> ~7x.
> > > > > With the "StringView" data type, only 16000000 bytes of new memory
> > > > > would need to be allocated, and the performance should be only 2-4x
> > > > > slower than the doubles case (because we only need to relocate a
> bunch
> > > > > of 16-byte structs) instead of 16x slower.
> > > > >
> > > > > I hope you can see now that this can be a rather serious resource
> > > > > utilization issue, both in processing time and memory use. I will
> > > > > update the document to explain this better and work on responding
> to
> > > > > some of the other comments.
> > > > >
> > > > > Wes
> > > > >
> > > > > On Tue, Dec 14, 2021 at 5:08 AM Antoine Pitrou <[email protected]
> >
> > > wrote:
> > > > > >
> > > > > >
> > > > > > Hello,
> > > > > >
> > > > > > I think my main concern is how we can prevent the community from
> > > > > > fragmenting too much over supported encodings.  The more complex
> the
> > > > > > encodings, the less likely they are to be supported by all main
> > > > > > implementations.  We see this in Parquet where the efficient
> "delta"
> > > > > > encodings have just received support in Parquet C++, and even,
> only
> > > on
> > > > > > the read side.
> > > > > >
> > > > > > There is an additional subtlety in that Arrow is not a storage
> > > mechanism
> > > > > > but it represents data in memory, so pieces doing computation
> have
> > > to be
> > > > > > adapted to the new encodings, for example the entire library of
> > > > > > computation kernels in Arrow C++ (of course, an easy but
> inefficient
> > > > > > adaptation is to always unpack to an already supported layout).
> > > > > >
> > > > > > As an anecdote, the Arrow C++ kernels are supposed to accept a
> > > selection
> > > > > > vector to filter their physical inputs, but none actually
> supports
> > > it.
> > > > > > I think we should be wary of adding ambitious new features that
> might
> > > > > > never get an actual implementation.
> > > > > >
> > > > > >
> > > > > > On the detail of the proposed encodings:
> > > > > >
> > > > > > - I hope we can avoid storing raw pointers instead of offsets
> into a
> > > > > > separate buffer; I understand the flexibility argument for
> pointers
> > > but
> > > > > > it will also make data transfer more complicated
> > > > > >
> > > > > > - Constant arrays are a special case of RLE arrays and I'm not
> sure
> > > > > > doing both is really useful
> > > > > >
> > > > > > - I don't really understand the concrete use case for the weird
> > > > > > "sequence view" layout; I'll note that non-monotonic offsets can
> make
> > > > > > linear traversal less efficient, since the CPU won't
> automatically
> > > > > > prefetch data for you
> > > > > >
> > > > > > - The proposed RLE encoding seems inefficient; usually, RLE
> encodings
> > > > > > try hard to minimize the size overhead of RLE sequences, such
> that
> > > they
> > > > > > become beneficial even for very short repeated runs
> > > > > >
> > > > > > Regards
> > > > > >
> > > > > > Antoine.
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > Le 10/12/2021 à 20:28, Wes McKinney a écrit :
> > > > > > >
> > > > > > > This topic may provoke , but, given that Arrow is approaching
> its
> > > > > > > 6-year anniversary, I think this is an important discussion
> about
> > > how
> > > > > > > we can thoughtfully expand the Arrow specifications to support
> > > > > > > next-generation columnar data processing. In recent times, I
> have
> > > been
> > > > > > > motivated by recent interactions with CWI's DuckDB and Meta's
> Velox
> > > > > > > open source projects and the innovations they've made around
> data
> > > > > > > representation providing beneficial features above and beyond
> what
> > > we
> > > > > > > have already in Arrow. For example, they have a 16-byte "string
> > > view"
> > > > > > > data type that enables buffer memory reuse, faster "false"
> > > comparisons
> > > > > > > on strings unequal in the first 4 bytes, and inline small
> strings.
> > > > > > > Both the Rust and C++ query engine efforts could potentially
> > > benefit
> > > > > > > from this (not sure about the memory safety implications in
> Rust,
> > > > > > > comments around this would be helpful).
> > > > > > >
> > > > > > > I wrote a document to start a discussion about a few new ways
> to
> > > > > > > represent data that may help with building
> > > > > > > Arrow-native/Arrow-compatible query engines:
> > > > > > >
> > > > > > >
> > >
> https://docs.google.com/document/d/12aZi8Inez9L_JCtZ6gi2XDbQpCsHICNy9_EUxj4ILeE/edit#
> > > > > > >
> > > > > > > Each of these potential additions would need to be eventually
> split
> > > > > > > off into independent efforts with associated additions to the
> > > columnar
> > > > > > > specification, IPC format, C ABI, integration tests, and so on.
> > > > > > >
> > > > > > > The document is open to anyone to comment but if anyone would
> like
> > > > > > > edit access please feel free to request and I look forward to
> the
> > > > > > > discussion.
> > > > > > >
> > > > > > > Thanks,
> > > > > > > Wes
> > > > > > >
> > >
>

Re: [DISCUSS] Adding new columnar memory layouts to Arrow (in-memory, IPC, C ABI)

Reply via email to