Hi,

Thanks a lot for this initiative and the write up.

I did a small bench for the sequence view and added a graph to the document
for evidence of what Wes is writing wrt to performance of "selection / take
/ filter".

Big +1 in replacing our current representation of variable-sized arrays by
the "sequence view". atm I am -0.5 in adding it without removing the
[Large]Utf8Array / Binary / List, as I see the advantages as sufficiently
large to break compatibility and deprecate the previous representations
(and do not enjoy maintaining multiple similar representations that solve
very similar problems).

Likewise, +1 for the RLE and -0.5 for the constant array, as the latter
seems redundant to me (it is an RLE).

Wrt to the string view: would like to run some benches on that too. Could
someone clarify what are the "good cases" for that one?

More generally, I second the point made by Antoine: there is already some
fragmentation over the types in the official implementations (see [1]), and
we do not even have a common integration test suite for the c data
interface. One approach to this dimension is to *deprecate*
representations, which goes into the direction mentioned above.

Wrt to design, we could consider a separate enum for the RLE vs plain
encoding, as they are not really semantic types (the dictionary is also not
a semantic type but it is represented as one in at least the Rust
implementation, unfortunately).

Wrt to Rust impl in particular, I do not think that the String View poses a
problem - Rust can layout according to the C representation. Here [2] is
the corresponding Rust code of the struct in the doc (generated via Rust's
bindgen [3]).

Thanks again for this, looking very much forward to it!

[1]
https://github.com/apache/arrow/blob/master/dev/archery/archery/integration/datagen.py#L1546
[2]
https://github.com/DataEngineeringLabs/arrow-string-view/blob/main/src/string_view.rs
[3] https://rust-lang.github.io/rust-bindgen/command-line-usage.html


On Wed, Dec 15, 2021 at 3:15 AM Wes McKinney <wesmck...@gmail.com> wrote:

> Ultimately, the problem comes down to providing a means of O(#
> records) selection (take, filter) performance and memory use for
> non-numeric data (strings, arrays, maps, etc.).
>
> DuckDB and Velox are two projects which have designed themselves to be
> very nearly Arrow-compatible but have implemented alternative memory
> layouts to achieve O(# records) selections on all data types. I am
> proposing to adopt these innovations as additional memory layouts in
> Arrow with a target of zero-copy across the C ABI — how exactly they
> are translated to the IPC format seems less of an immediate benefit
> than enabling the in-memory performance/memory use optimization since
> query engines can accelerate performance with faster selections. If
> there are some alternative proposals to achieve O(# records) time and
> space complexity for selection operations, let's definitely look at
> them.
>
>
> On Tue, Dec 14, 2021 at 8:02 PM Weston Pace <weston.p...@gmail.com> wrote:
> >
> > Would it be simpler to change the spec so that child arrays can be
> > chunked?  This might reduce the data type growth and make the intent
> > more clear.
> >
> > This will add another dimension to performance analysis.  We pretty
> > regularly get issues/tickets from users that have unknowingly created
> > parquet files with poor row group resolution (e.g. 50 rows per row
> > group) and experience rotten performance as a result.  I suspect
> > something similar could happen here.  It sounds like arrays will
> > naturally subdivide over time.  Users might start seeing poor
> > performance without realizing the root cause is because their 1
> > million element array has been split into 10,000 allocations of 100
> > elements.  However, I suspect this is something that could be managed
> > with visibility and recompaction utilities.
> >
> >
> > On Tue, Dec 14, 2021 at 1:22 PM Wes McKinney <wesmck...@gmail.com>
> wrote:
> > >
> > > hi folks,
> > >
> > > A few things in the general discussion, before certain things will
> > > have to be split off into their own dedicated discussions.
> > >
> > > It seems that I didn't do a very good job of motivating the "sequence
> > > view" type. Let me take a step back and discuss one of the problems
> > > these new memory layouts are solving.
> > >
> > > In Arrow currently, selection operations ("take", "filter", or
> > > indirect sort — the equivalent of arr.take(argsort(something_else)) if
> > > you're coming from NumPy) have time complexity proportional to the
> > > number of records for primitive types and complexity proportional to
> > > the greater of max(# records, memory size) for nested types.
> > >
> > > So, for example:
> > >
> > > * Take(arr, indices) has O(# records) complexity for primitive types
> > > and does O(# records) memory allocation
> > > * Take(arr, indices) has O(max(# records, size of memory buffers /
> > > child arrays)) complexity for strings and nested types and does O(size
> > > of memory buffers) memory allocation
> > >
> > > This means that columnar query engines that leverage selections can
> > > experience heavy costs both in time complexity and memory use when
> > > doing selections on non-primitive array data. Selections may arise
> > > from filtering or sorting or other operations.
> > >
> > > The "String view" and "Sequence view" memory layouts in this document
> > > do not have this problem. When using these for strings and nested
> > > data, they have the same time complexity and memory allocation
> > > behavior for selections as primitive types, and the "child" memory
> > > buffers do not have to be manipulated or rebuilt at all. This has
> > > significant performance benefits and reduced memory use.
> > >
> > > Additionally, the string view and sequence view layouts solve the
> > > problem of out-of-order construction. As has been pointed out, one way
> > > to work around this issue at present is to use "chunked arrays".
> > > However, this means that you cannot ever use thread parallelism in the
> > > construction of non-chunked outputs with nested data (for example, in
> > > expression evaluation) — if a nested array forms part of a record
> > > batch, then either you must stick to single-threaded execution or use
> > > thread parallelism to subdivide even the other fields of the record
> > > batch that are non-nested to obtain equal-sized arrays across all
> > > fields. For example, if you had a record batch with 32K rows and
> > > wanted to parallelize execution of a projection using 4 threads — you
> > > would need to divide all fields into chunks of 8K each prior to
> > > beginning to produce outputs. This is fairly inflexible.
> > >
> > > As another motivating example, consider a parallel selection operation
> > > (e.g. "take" or "filter") on a nested array. Currently it is not
> > > possible to parallelize at all because of the in-order construction
> > > requirement.
> > >
> > > I don't expect you to just trust me — here is an example:
> > >
> > > https://gist.github.com/wesm/25fc7b877f913c7e4449117178302646
> > >
> > > In this example, I use Take to permute 1M doubles and 1M strings with
> > > 50 bytes each
> > >
> > > * Doubles: 2.45ms (new memory allocated: 8000000)
> > > * Strings: 39.6ms (new memory allocated: 54000000)
> > >
> > > The performance ratio is 16x even though the memory ratio is only ~7x.
> > > With the "StringView" data type, only 16000000 bytes of new memory
> > > would need to be allocated, and the performance should be only 2-4x
> > > slower than the doubles case (because we only need to relocate a bunch
> > > of 16-byte structs) instead of 16x slower.
> > >
> > > I hope you can see now that this can be a rather serious resource
> > > utilization issue, both in processing time and memory use. I will
> > > update the document to explain this better and work on responding to
> > > some of the other comments.
> > >
> > > Wes
> > >
> > > On Tue, Dec 14, 2021 at 5:08 AM Antoine Pitrou <anto...@python.org>
> wrote:
> > > >
> > > >
> > > > Hello,
> > > >
> > > > I think my main concern is how we can prevent the community from
> > > > fragmenting too much over supported encodings.  The more complex the
> > > > encodings, the less likely they are to be supported by all main
> > > > implementations.  We see this in Parquet where the efficient "delta"
> > > > encodings have just received support in Parquet C++, and even, only
> on
> > > > the read side.
> > > >
> > > > There is an additional subtlety in that Arrow is not a storage
> mechanism
> > > > but it represents data in memory, so pieces doing computation have
> to be
> > > > adapted to the new encodings, for example the entire library of
> > > > computation kernels in Arrow C++ (of course, an easy but inefficient
> > > > adaptation is to always unpack to an already supported layout).
> > > >
> > > > As an anecdote, the Arrow C++ kernels are supposed to accept a
> selection
> > > > vector to filter their physical inputs, but none actually supports
> it.
> > > > I think we should be wary of adding ambitious new features that might
> > > > never get an actual implementation.
> > > >
> > > >
> > > > On the detail of the proposed encodings:
> > > >
> > > > - I hope we can avoid storing raw pointers instead of offsets into a
> > > > separate buffer; I understand the flexibility argument for pointers
> but
> > > > it will also make data transfer more complicated
> > > >
> > > > - Constant arrays are a special case of RLE arrays and I'm not sure
> > > > doing both is really useful
> > > >
> > > > - I don't really understand the concrete use case for the weird
> > > > "sequence view" layout; I'll note that non-monotonic offsets can make
> > > > linear traversal less efficient, since the CPU won't automatically
> > > > prefetch data for you
> > > >
> > > > - The proposed RLE encoding seems inefficient; usually, RLE encodings
> > > > try hard to minimize the size overhead of RLE sequences, such that
> they
> > > > become beneficial even for very short repeated runs
> > > >
> > > > Regards
> > > >
> > > > Antoine.
> > > >
> > > >
> > > >
> > > >
> > > > Le 10/12/2021 à 20:28, Wes McKinney a écrit :
> > > > >
> > > > > This topic may provoke , but, given that Arrow is approaching its
> > > > > 6-year anniversary, I think this is an important discussion about
> how
> > > > > we can thoughtfully expand the Arrow specifications to support
> > > > > next-generation columnar data processing. In recent times, I have
> been
> > > > > motivated by recent interactions with CWI's DuckDB and Meta's Velox
> > > > > open source projects and the innovations they've made around data
> > > > > representation providing beneficial features above and beyond what
> we
> > > > > have already in Arrow. For example, they have a 16-byte "string
> view"
> > > > > data type that enables buffer memory reuse, faster "false"
> comparisons
> > > > > on strings unequal in the first 4 bytes, and inline small strings.
> > > > > Both the Rust and C++ query engine efforts could potentially
> benefit
> > > > > from this (not sure about the memory safety implications in Rust,
> > > > > comments around this would be helpful).
> > > > >
> > > > > I wrote a document to start a discussion about a few new ways to
> > > > > represent data that may help with building
> > > > > Arrow-native/Arrow-compatible query engines:
> > > > >
> > > > >
> https://docs.google.com/document/d/12aZi8Inez9L_JCtZ6gi2XDbQpCsHICNy9_EUxj4ILeE/edit#
> > > > >
> > > > > Each of these potential additions would need to be eventually split
> > > > > off into independent efforts with associated additions to the
> columnar
> > > > > specification, IPC format, C ABI, integration tests, and so on.
> > > > >
> > > > > The document is open to anyone to comment but if anyone would like
> > > > > edit access please feel free to request and I look forward to the
> > > > > discussion.
> > > > >
> > > > > Thanks,
> > > > > Wes
> > > > >
>

Reply via email to