> > Big +1 in replacing our current representation of variable-sized arrays by > the "sequence view". atm I am -0.5 in adding it without removing the > [Large]Utf8Array / Binary / List, as I see the advantages as sufficiently > large to break compatibility and deprecate the previous representations > (and do not enjoy maintaining multiple similar representations that solve > very similar problems).
I don't think we should be deprecating anything in favor of these at a format level. There are many production systems built on the current encodings, and we've publicly documented such changes as a rare occurrence [1]. If we do adopt the new encodings I think it is fine if most systems only want to deal with them on system boundaries and convert to one of the new encodings. [1] https://arrow.apache.org/docs/format/Versioning.html#long-term-stability On Wed, Dec 15, 2021 at 1:35 PM Weston Pace <weston.p...@gmail.com> wrote: > > I am -0.5 in adding it without removing the > > [Large]Utf8Array / Binary / List > > I'm not sure about dropping List. > > Is SequenceView semantically equivalent to List / FixedSizeList? In > other words, is SequenceView a nested type? The document seems to > suggest it is but the use case you described does not. For example, > in the C++ compute today you cannot add List<INT32> + List<INT32> but > I think you would want to be able to add SequenceView<INT32> + > SequenceView<INT32>. Also, the size of a List<INT32> is the # of > lists and not the # of items. For a SequenceView I think the size of > the array would be the number of items. I would also consider it a > semantic change to go from Struct{"x": INT32} to Struct{"X": > List<INT32>}. > > From the use case it sounds more like SequenceView would be similar to > dictionary and RLE, a different encoding for existing arrays. > However, it is possible I am misreading things. > > On Wed, Dec 15, 2021 at 10:49 AM Jorge Cardoso Leitão > <jorgecarlei...@gmail.com> wrote: > > > > Hi, > > > > Thanks a lot for this initiative and the write up. > > > > I did a small bench for the sequence view and added a graph to the > document > > for evidence of what Wes is writing wrt to performance of "selection / > take > > / filter". > > > > Big +1 in replacing our current representation of variable-sized arrays > by > > the "sequence view". atm I am -0.5 in adding it without removing the > > [Large]Utf8Array / Binary / List, as I see the advantages as sufficiently > > large to break compatibility and deprecate the previous representations > > (and do not enjoy maintaining multiple similar representations that solve > > very similar problems). > > > > Likewise, +1 for the RLE and -0.5 for the constant array, as the latter > > seems redundant to me (it is an RLE). > > > > Wrt to the string view: would like to run some benches on that too. Could > > someone clarify what are the "good cases" for that one? > > > > More generally, I second the point made by Antoine: there is already some > > fragmentation over the types in the official implementations (see [1]), > and > > we do not even have a common integration test suite for the c data > > interface. One approach to this dimension is to *deprecate* > > representations, which goes into the direction mentioned above. > > > > Wrt to design, we could consider a separate enum for the RLE vs plain > > encoding, as they are not really semantic types (the dictionary is also > not > > a semantic type but it is represented as one in at least the Rust > > implementation, unfortunately). > > > > Wrt to Rust impl in particular, I do not think that the String View > poses a > > problem - Rust can layout according to the C representation. Here [2] is > > the corresponding Rust code of the struct in the doc (generated via > Rust's > > bindgen [3]). > > > > Thanks again for this, looking very much forward to it! > > > > [1] > > > https://github.com/apache/arrow/blob/master/dev/archery/archery/integration/datagen.py#L1546 > > [2] > > > https://github.com/DataEngineeringLabs/arrow-string-view/blob/main/src/string_view.rs > > [3] https://rust-lang.github.io/rust-bindgen/command-line-usage.html > > > > > > On Wed, Dec 15, 2021 at 3:15 AM Wes McKinney <wesmck...@gmail.com> > wrote: > > > > > Ultimately, the problem comes down to providing a means of O(# > > > records) selection (take, filter) performance and memory use for > > > non-numeric data (strings, arrays, maps, etc.). > > > > > > DuckDB and Velox are two projects which have designed themselves to be > > > very nearly Arrow-compatible but have implemented alternative memory > > > layouts to achieve O(# records) selections on all data types. I am > > > proposing to adopt these innovations as additional memory layouts in > > > Arrow with a target of zero-copy across the C ABI — how exactly they > > > are translated to the IPC format seems less of an immediate benefit > > > than enabling the in-memory performance/memory use optimization since > > > query engines can accelerate performance with faster selections. If > > > there are some alternative proposals to achieve O(# records) time and > > > space complexity for selection operations, let's definitely look at > > > them. > > > > > > > > > On Tue, Dec 14, 2021 at 8:02 PM Weston Pace <weston.p...@gmail.com> > wrote: > > > > > > > > Would it be simpler to change the spec so that child arrays can be > > > > chunked? This might reduce the data type growth and make the intent > > > > more clear. > > > > > > > > This will add another dimension to performance analysis. We pretty > > > > regularly get issues/tickets from users that have unknowingly created > > > > parquet files with poor row group resolution (e.g. 50 rows per row > > > > group) and experience rotten performance as a result. I suspect > > > > something similar could happen here. It sounds like arrays will > > > > naturally subdivide over time. Users might start seeing poor > > > > performance without realizing the root cause is because their 1 > > > > million element array has been split into 10,000 allocations of 100 > > > > elements. However, I suspect this is something that could be managed > > > > with visibility and recompaction utilities. > > > > > > > > > > > > On Tue, Dec 14, 2021 at 1:22 PM Wes McKinney <wesmck...@gmail.com> > > > wrote: > > > > > > > > > > hi folks, > > > > > > > > > > A few things in the general discussion, before certain things will > > > > > have to be split off into their own dedicated discussions. > > > > > > > > > > It seems that I didn't do a very good job of motivating the > "sequence > > > > > view" type. Let me take a step back and discuss one of the problems > > > > > these new memory layouts are solving. > > > > > > > > > > In Arrow currently, selection operations ("take", "filter", or > > > > > indirect sort — the equivalent of > arr.take(argsort(something_else)) if > > > > > you're coming from NumPy) have time complexity proportional to the > > > > > number of records for primitive types and complexity proportional > to > > > > > the greater of max(# records, memory size) for nested types. > > > > > > > > > > So, for example: > > > > > > > > > > * Take(arr, indices) has O(# records) complexity for primitive > types > > > > > and does O(# records) memory allocation > > > > > * Take(arr, indices) has O(max(# records, size of memory buffers / > > > > > child arrays)) complexity for strings and nested types and does > O(size > > > > > of memory buffers) memory allocation > > > > > > > > > > This means that columnar query engines that leverage selections can > > > > > experience heavy costs both in time complexity and memory use when > > > > > doing selections on non-primitive array data. Selections may arise > > > > > from filtering or sorting or other operations. > > > > > > > > > > The "String view" and "Sequence view" memory layouts in this > document > > > > > do not have this problem. When using these for strings and nested > > > > > data, they have the same time complexity and memory allocation > > > > > behavior for selections as primitive types, and the "child" memory > > > > > buffers do not have to be manipulated or rebuilt at all. This has > > > > > significant performance benefits and reduced memory use. > > > > > > > > > > Additionally, the string view and sequence view layouts solve the > > > > > problem of out-of-order construction. As has been pointed out, one > way > > > > > to work around this issue at present is to use "chunked arrays". > > > > > However, this means that you cannot ever use thread parallelism in > the > > > > > construction of non-chunked outputs with nested data (for example, > in > > > > > expression evaluation) — if a nested array forms part of a record > > > > > batch, then either you must stick to single-threaded execution or > use > > > > > thread parallelism to subdivide even the other fields of the record > > > > > batch that are non-nested to obtain equal-sized arrays across all > > > > > fields. For example, if you had a record batch with 32K rows and > > > > > wanted to parallelize execution of a projection using 4 threads — > you > > > > > would need to divide all fields into chunks of 8K each prior to > > > > > beginning to produce outputs. This is fairly inflexible. > > > > > > > > > > As another motivating example, consider a parallel selection > operation > > > > > (e.g. "take" or "filter") on a nested array. Currently it is not > > > > > possible to parallelize at all because of the in-order construction > > > > > requirement. > > > > > > > > > > I don't expect you to just trust me — here is an example: > > > > > > > > > > https://gist.github.com/wesm/25fc7b877f913c7e4449117178302646 > > > > > > > > > > In this example, I use Take to permute 1M doubles and 1M strings > with > > > > > 50 bytes each > > > > > > > > > > * Doubles: 2.45ms (new memory allocated: 8000000) > > > > > * Strings: 39.6ms (new memory allocated: 54000000) > > > > > > > > > > The performance ratio is 16x even though the memory ratio is only > ~7x. > > > > > With the "StringView" data type, only 16000000 bytes of new memory > > > > > would need to be allocated, and the performance should be only 2-4x > > > > > slower than the doubles case (because we only need to relocate a > bunch > > > > > of 16-byte structs) instead of 16x slower. > > > > > > > > > > I hope you can see now that this can be a rather serious resource > > > > > utilization issue, both in processing time and memory use. I will > > > > > update the document to explain this better and work on responding > to > > > > > some of the other comments. > > > > > > > > > > Wes > > > > > > > > > > On Tue, Dec 14, 2021 at 5:08 AM Antoine Pitrou <anto...@python.org > > > > > wrote: > > > > > > > > > > > > > > > > > > Hello, > > > > > > > > > > > > I think my main concern is how we can prevent the community from > > > > > > fragmenting too much over supported encodings. The more complex > the > > > > > > encodings, the less likely they are to be supported by all main > > > > > > implementations. We see this in Parquet where the efficient > "delta" > > > > > > encodings have just received support in Parquet C++, and even, > only > > > on > > > > > > the read side. > > > > > > > > > > > > There is an additional subtlety in that Arrow is not a storage > > > mechanism > > > > > > but it represents data in memory, so pieces doing computation > have > > > to be > > > > > > adapted to the new encodings, for example the entire library of > > > > > > computation kernels in Arrow C++ (of course, an easy but > inefficient > > > > > > adaptation is to always unpack to an already supported layout). > > > > > > > > > > > > As an anecdote, the Arrow C++ kernels are supposed to accept a > > > selection > > > > > > vector to filter their physical inputs, but none actually > supports > > > it. > > > > > > I think we should be wary of adding ambitious new features that > might > > > > > > never get an actual implementation. > > > > > > > > > > > > > > > > > > On the detail of the proposed encodings: > > > > > > > > > > > > - I hope we can avoid storing raw pointers instead of offsets > into a > > > > > > separate buffer; I understand the flexibility argument for > pointers > > > but > > > > > > it will also make data transfer more complicated > > > > > > > > > > > > - Constant arrays are a special case of RLE arrays and I'm not > sure > > > > > > doing both is really useful > > > > > > > > > > > > - I don't really understand the concrete use case for the weird > > > > > > "sequence view" layout; I'll note that non-monotonic offsets can > make > > > > > > linear traversal less efficient, since the CPU won't > automatically > > > > > > prefetch data for you > > > > > > > > > > > > - The proposed RLE encoding seems inefficient; usually, RLE > encodings > > > > > > try hard to minimize the size overhead of RLE sequences, such > that > > > they > > > > > > become beneficial even for very short repeated runs > > > > > > > > > > > > Regards > > > > > > > > > > > > Antoine. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Le 10/12/2021 à 20:28, Wes McKinney a écrit : > > > > > > > > > > > > > > This topic may provoke , but, given that Arrow is approaching > its > > > > > > > 6-year anniversary, I think this is an important discussion > about > > > how > > > > > > > we can thoughtfully expand the Arrow specifications to support > > > > > > > next-generation columnar data processing. In recent times, I > have > > > been > > > > > > > motivated by recent interactions with CWI's DuckDB and Meta's > Velox > > > > > > > open source projects and the innovations they've made around > data > > > > > > > representation providing beneficial features above and beyond > what > > > we > > > > > > > have already in Arrow. For example, they have a 16-byte "string > > > view" > > > > > > > data type that enables buffer memory reuse, faster "false" > > > comparisons > > > > > > > on strings unequal in the first 4 bytes, and inline small > strings. > > > > > > > Both the Rust and C++ query engine efforts could potentially > > > benefit > > > > > > > from this (not sure about the memory safety implications in > Rust, > > > > > > > comments around this would be helpful). > > > > > > > > > > > > > > I wrote a document to start a discussion about a few new ways > to > > > > > > > represent data that may help with building > > > > > > > Arrow-native/Arrow-compatible query engines: > > > > > > > > > > > > > > > > > > https://docs.google.com/document/d/12aZi8Inez9L_JCtZ6gi2XDbQpCsHICNy9_EUxj4ILeE/edit# > > > > > > > > > > > > > > Each of these potential additions would need to be eventually > split > > > > > > > off into independent efforts with associated additions to the > > > columnar > > > > > > > specification, IPC format, C ABI, integration tests, and so on. > > > > > > > > > > > > > > The document is open to anyone to comment but if anyone would > like > > > > > > > edit access please feel free to request and I look forward to > the > > > > > > > discussion. > > > > > > > > > > > > > > Thanks, > > > > > > > Wes > > > > > > > > > > >