"1. Is it so those projects could use the arrow compute kernels instead of their own?"
This is the case we met. We offloaded Spark JVM SQL engine to native library (gazelle-plugin) which is based on arrow computer kernel, but customer asked if it's possible to use Velox as native backend because in this way they can share the same native library between Spark and presto. While Velox doesn't have java API so we would like to re-use our Arrow implementations in Spark. In this way Spark only need to depend on arrow columnar format which is already there. Now we need to convert Velox format and arrow format for each JNI call which is expensive if common format like strings have big difference. If arrow and Velox can use the same format in common data types, it can benefit both communities. Even not exactly the same but we can have tolerable overhead on conversion, it's also workable. Backward compatible in libraries will be big challenge, which means in all arrow libraries we will need check arrow format versions. Thanks Binwei -----Original Message----- From: Andrew Lamb <al...@influxdata.com> Sent: Thursday, December 16, 2021 21:40 To: dev <dev@arrow.apache.org> Cc: Micah Kornfield <emkornfi...@gmail.com> Subject: Re: [DISCUSS] Adding new columnar memory layouts to Arrow (in-memory, IPC, C ABI) > DuckDB and Velox are two projects which have designed themselves to be > very nearly Arrow-compatible but have implemented alternative memory > layouts to achieve O(# records) selections on all data types. I am > proposing to adopt these innovations as additional memory layouts in > Arrow with a target of zero-copy across the C ABI I have probably missed this, and if so I apologize What is the usecase for sharing data to/from DuckDB and Velox to the C/C++ implementation of Arrow via the C ABI? 1. Is it so those projects could use the arrow compute kernels instead of their own? 2. Enable high speed parquet writing directly from output produced by those engines ? 3. Some other reason? I think a key part of Arrow's strength is its interoperability between languages and its serializability. For more specialized needs in query engines, perhaps using specialized implementations such as StringView with raw pointers and converting to/from Arrow at the boundaries could be reasonable. Andrew On Wed, Dec 15, 2021 at 7:40 PM Wes McKinney <wesmck...@gmail.com> wrote: > On Wed, Dec 15, 2021 at 6:22 PM Micah Kornfield > <emkornfi...@gmail.com> > wrote: > >> > >> In any case, having memory layouts that support O(# records) > >> selections on strings and nested data will greatly benefit some > >> data processing systems built on Arrow. > > > > > > Wes, something that still isn't clear to me, are we proposing these > > new > encoding for ONLY the C-ABI or do we want to plumb them or their > analogues through the IPC/File layout as well? > > > > In my mind, I think plumbing RLE though the IPC would definitely > > make > sense. The other ones might or might not make sense given the > overhead needed to sanitize them properly. > > > > Regarding the IPC wire format: > > * String view: not necessary, but an IPC serialization format where > only the non-inline data is written out to a data buffer (and > swizzling pointers to offsets) could be created. The unused "inline" > portions for short strings would need to be zeroed out to not leak > bytes > * Sequence view: would be simple to write to IPC without serialization > (just two buffers: one with offsets and another with sizes, and a > normal child) > * RLE / constant / all-null: yes > > So it seems the string views are where there is uncertainty around > whether having some degree of IPC support is useful. We might need to > do some experiments to quantify the performance benefits (for example, > when transmitting such data over Flight). > > > Thanks, > > Micah > > > > On Wed, Dec 15, 2021 at 2:43 PM Wes McKinney <wesmck...@gmail.com> > wrote: > >> > >> On Wed, Dec 15, 2021 at 3:56 PM Micah Kornfield > >> <emkornfi...@gmail.com> > wrote: > >> > > >> > > > >> > > Big +1 in replacing our current representation of > >> > > variable-sized > arrays by > >> > > the "sequence view". atm I am -0.5 in adding it without > >> > > removing the [Large]Utf8Array / Binary / List, as I see the > >> > > advantages as > sufficiently > >> > > large to break compatibility and deprecate the previous > representations > >> > > (and do not enjoy maintaining multiple similar representations > >> > > that > solve > >> > > very similar problems). > >> > > >> > > >> > I don't think we should be deprecating anything in favor of these > >> > at a format level. There are many production systems built on > >> > the current encodings, and we've publicly documented such changes > >> > as a rare > occurrence > >> > [1]. If we do adopt the new encodings I think it is fine if most > systems > >> > only want to deal with them on system boundaries and convert to > >> > one > of the > >> > new encodings. > >> > >> I agree with Micah's perspective here. We don't want to disrupt any > >> production systems that have been built in the last 6 years. There > >> are many of them at this point. There might be systems that would > >> choose not to use these new encodings at all, and if they encounter > >> them at the C ABI, they could choose to convert to one of the > >> existing representations. > >> > >> For example, in the C ABI, one might choose: > >> > >> * UTF8 StringView -> convert to existing Arrow string layout > >> * List/Map SequenceView -> convert to existing Arrow list layout > >> * RLE / Constant View of anything -> unroll to non-RLE / Constant > >> Arrow > layout > >> > >> In a sense, giving systems the freedom to elect which > >> representation makes sense for them but not requiring them to > >> *emit* these new representations at the C ABI boundaries (beyond > >> potentially implementing the conversion functions if they choose > >> not to implement them at all). Converting from this StringView or > >> SequenceView representation to one of the existing memory layouts > >> is not complicated so I do not think this is too burdensome. > >> > >> In any case, having memory layouts that support O(# records) > >> selections on strings and nested data will greatly benefit some > >> data processing systems built on Arrow. If we don't have something > >> like this "natively" in Arrow, then you are forcing such systems to > >> serialize when pushing or pulling data from other Arrow-based systems. > >> Rather than have every system implement some solution for this > >> problem in a slightly different way, by establishing a standard in > >> Arrow we can increase the likelihood that more systems will use a > >> common representation that can be zero-copied (if that's what one > >> wants) through the C ABI. > >> > >> > [1] > https://arrow.apache.org/docs/format/Versioning.html#long-term-stabili > ty > >> > > >> > On Wed, Dec 15, 2021 at 1:35 PM Weston Pace > >> > <weston.p...@gmail.com> > wrote: > >> > > >> > > > I am -0.5 in adding it without removing the [Large]Utf8Array > >> > > > / Binary / List > >> > > > >> > > I'm not sure about dropping List. > >> > > > >> > > Is SequenceView semantically equivalent to List / > >> > > FixedSizeList? In other words, is SequenceView a nested type? > >> > > The document seems to suggest it is but the use case you > >> > > described does not. For example, in the C++ compute today you > >> > > cannot add List<INT32> + List<INT32> > but > >> > > I think you would want to be able to add SequenceView<INT32> + > >> > > SequenceView<INT32>. Also, the size of a List<INT32> is the # > >> > > of lists and not the # of items. For a SequenceView I think > >> > > the size > of > >> > > the array would be the number of items. I would also consider > >> > > it a semantic change to go from Struct{"x": INT32} to Struct{"X": > >> > > List<INT32>}. > >> > > > >> > > From the use case it sounds more like SequenceView would be > >> > > similar > to > >> > > dictionary and RLE, a different encoding for existing arrays. > >> > > However, it is possible I am misreading things. > >> > > > >> > > On Wed, Dec 15, 2021 at 10:49 AM Jorge Cardoso Leitão > >> > > <jorgecarlei...@gmail.com> wrote: > >> > > > > >> > > > Hi, > >> > > > > >> > > > Thanks a lot for this initiative and the write up. > >> > > > > >> > > > I did a small bench for the sequence view and added a graph > >> > > > to the > >> > > document > >> > > > for evidence of what Wes is writing wrt to performance of > "selection / > >> > > take > >> > > > / filter". > >> > > > > >> > > > Big +1 in replacing our current representation of > >> > > > variable-sized > arrays > >> > > by > >> > > > the "sequence view". atm I am -0.5 in adding it without > >> > > > removing > the > >> > > > [Large]Utf8Array / Binary / List, as I see the advantages as > sufficiently > >> > > > large to break compatibility and deprecate the previous > representations > >> > > > (and do not enjoy maintaining multiple similar > >> > > > representations > that solve > >> > > > very similar problems). > >> > > > > >> > > > Likewise, +1 for the RLE and -0.5 for the constant array, as > >> > > > the > latter > >> > > > seems redundant to me (it is an RLE). > >> > > > > >> > > > Wrt to the string view: would like to run some benches on > >> > > > that > too. Could > >> > > > someone clarify what are the "good cases" for that one? > >> > > > > >> > > > More generally, I second the point made by Antoine: there is > already some > >> > > > fragmentation over the types in the official implementations > >> > > > (see > [1]), > >> > > and > >> > > > we do not even have a common integration test suite for the c > >> > > > data interface. One approach to this dimension is to > >> > > > *deprecate* representations, which goes into the direction mentioned > >> > > > above. > >> > > > > >> > > > Wrt to design, we could consider a separate enum for the RLE > >> > > > vs > plain > >> > > > encoding, as they are not really semantic types (the > >> > > > dictionary > is also > >> > > not > >> > > > a semantic type but it is represented as one in at least the > >> > > > Rust implementation, unfortunately). > >> > > > > >> > > > Wrt to Rust impl in particular, I do not think that the > >> > > > String > View > >> > > poses a > >> > > > problem - Rust can layout according to the C representation. > >> > > > Here > [2] is > >> > > > the corresponding Rust code of the struct in the doc > >> > > > (generated > via > >> > > Rust's > >> > > > bindgen [3]). > >> > > > > >> > > > Thanks again for this, looking very much forward to it! > >> > > > > >> > > > [1] > >> > > > > >> > > > https://github.com/apache/arrow/blob/master/dev/archery/archery/integr > ation/datagen.py#L1546 > >> > > > [2] > >> > > > > >> > > > https://github.com/DataEngineeringLabs/arrow-string-view/blob/main/src > /string_view.rs > >> > > > [3] > https://rust-lang.github.io/rust-bindgen/command-line-usage.html > >> > > > > >> > > > > >> > > > On Wed, Dec 15, 2021 at 3:15 AM Wes McKinney > >> > > > <wesmck...@gmail.com > > > >> > > wrote: > >> > > > > >> > > > > Ultimately, the problem comes down to providing a means of > >> > > > > O(# > >> > > > > records) selection (take, filter) performance and memory > >> > > > > use for non-numeric data (strings, arrays, maps, etc.). > >> > > > > > >> > > > > DuckDB and Velox are two projects which have designed > themselves to be > >> > > > > very nearly Arrow-compatible but have implemented > >> > > > > alternative > memory > >> > > > > layouts to achieve O(# records) selections on all data > >> > > > > types. I > am > >> > > > > proposing to adopt these innovations as additional memory > layouts in > >> > > > > Arrow with a target of zero-copy across the C ABI — how > >> > > > > exactly > they > >> > > > > are translated to the IPC format seems less of an immediate > benefit > >> > > > > than enabling the in-memory performance/memory use > >> > > > > optimization > since > >> > > > > query engines can accelerate performance with faster > selections. If > >> > > > > there are some alternative proposals to achieve O(# > >> > > > > records) > time and > >> > > > > space complexity for selection operations, let's definitely > look at > >> > > > > them. > >> > > > > > >> > > > > > >> > > > > On Tue, Dec 14, 2021 at 8:02 PM Weston Pace < > weston.p...@gmail.com> > >> > > wrote: > >> > > > > > > >> > > > > > Would it be simpler to change the spec so that child > >> > > > > > arrays > can be > >> > > > > > chunked? This might reduce the data type growth and make > >> > > > > > the > intent > >> > > > > > more clear. > >> > > > > > > >> > > > > > This will add another dimension to performance analysis. > >> > > > > > We > pretty > >> > > > > > regularly get issues/tickets from users that have > >> > > > > > unknowingly > created > >> > > > > > parquet files with poor row group resolution (e.g. 50 > >> > > > > > rows > per row > >> > > > > > group) and experience rotten performance as a result. I > suspect > >> > > > > > something similar could happen here. It sounds like > >> > > > > > arrays > will > >> > > > > > naturally subdivide over time. Users might start seeing > >> > > > > > poor performance without realizing the root cause is > >> > > > > > because their > 1 > >> > > > > > million element array has been split into 10,000 > >> > > > > > allocations > of 100 > >> > > > > > elements. However, I suspect this is something that > >> > > > > > could be > managed > >> > > > > > with visibility and recompaction utilities. > >> > > > > > > >> > > > > > > >> > > > > > On Tue, Dec 14, 2021 at 1:22 PM Wes McKinney < > wesmck...@gmail.com> > >> > > > > wrote: > >> > > > > > > > >> > > > > > > hi folks, > >> > > > > > > > >> > > > > > > A few things in the general discussion, before certain > things will > >> > > > > > > have to be split off into their own dedicated discussions. > >> > > > > > > > >> > > > > > > It seems that I didn't do a very good job of motivating > >> > > > > > > the > >> > > "sequence > >> > > > > > > view" type. Let me take a step back and discuss one of > >> > > > > > > the > problems > >> > > > > > > these new memory layouts are solving. > >> > > > > > > > >> > > > > > > In Arrow currently, selection operations ("take", > >> > > > > > > "filter", > or > >> > > > > > > indirect sort — the equivalent of > >> > > arr.take(argsort(something_else)) if > >> > > > > > > you're coming from NumPy) have time complexity > >> > > > > > > proportional > to the > >> > > > > > > number of records for primitive types and complexity > proportional > >> > > to > >> > > > > > > the greater of max(# records, memory size) for nested types. > >> > > > > > > > >> > > > > > > So, for example: > >> > > > > > > > >> > > > > > > * Take(arr, indices) has O(# records) complexity for > primitive > >> > > types > >> > > > > > > and does O(# records) memory allocation > >> > > > > > > * Take(arr, indices) has O(max(# records, size of > >> > > > > > > memory > buffers / > >> > > > > > > child arrays)) complexity for strings and nested types > >> > > > > > > and > does > >> > > O(size > >> > > > > > > of memory buffers) memory allocation > >> > > > > > > > >> > > > > > > This means that columnar query engines that leverage > selections can > >> > > > > > > experience heavy costs both in time complexity and > >> > > > > > > memory > use when > >> > > > > > > doing selections on non-primitive array data. > >> > > > > > > Selections > may arise > >> > > > > > > from filtering or sorting or other operations. > >> > > > > > > > >> > > > > > > The "String view" and "Sequence view" memory layouts in > >> > > > > > > this > >> > > document > >> > > > > > > do not have this problem. When using these for strings > >> > > > > > > and > nested > >> > > > > > > data, they have the same time complexity and memory > allocation > >> > > > > > > behavior for selections as primitive types, and the "child" > memory > >> > > > > > > buffers do not have to be manipulated or rebuilt at all. > This has > >> > > > > > > significant performance benefits and reduced memory use. > >> > > > > > > > >> > > > > > > Additionally, the string view and sequence view layouts > solve the > >> > > > > > > problem of out-of-order construction. As has been > >> > > > > > > pointed > out, one > >> > > way > >> > > > > > > to work around this issue at present is to use "chunked > arrays". > >> > > > > > > However, this means that you cannot ever use thread > parallelism in > >> > > the > >> > > > > > > construction of non-chunked outputs with nested data > >> > > > > > > (for > example, > >> > > in > >> > > > > > > expression evaluation) — if a nested array forms part > >> > > > > > > of a > record > >> > > > > > > batch, then either you must stick to single-threaded > execution or > >> > > use > >> > > > > > > thread parallelism to subdivide even the other fields > >> > > > > > > of > the record > >> > > > > > > batch that are non-nested to obtain equal-sized arrays > across all > >> > > > > > > fields. For example, if you had a record batch with 32K > rows and > >> > > > > > > wanted to parallelize execution of a projection using 4 > threads — > >> > > you > >> > > > > > > would need to divide all fields into chunks of 8K each > prior to > >> > > > > > > beginning to produce outputs. This is fairly inflexible. > >> > > > > > > > >> > > > > > > As another motivating example, consider a parallel > >> > > > > > > selection > >> > > operation > >> > > > > > > (e.g. "take" or "filter") on a nested array. Currently > >> > > > > > > it > is not > >> > > > > > > possible to parallelize at all because of the in-order > construction > >> > > > > > > requirement. > >> > > > > > > > >> > > > > > > I don't expect you to just trust me — here is an example: > >> > > > > > > > >> > > > > > > > https://gist.github.com/wesm/25fc7b877f913c7e4449117178302646 > >> > > > > > > > >> > > > > > > In this example, I use Take to permute 1M doubles and > >> > > > > > > 1M > strings > >> > > with > >> > > > > > > 50 bytes each > >> > > > > > > > >> > > > > > > * Doubles: 2.45ms (new memory allocated: 8000000) > >> > > > > > > * Strings: 39.6ms (new memory allocated: 54000000) > >> > > > > > > > >> > > > > > > The performance ratio is 16x even though the memory > >> > > > > > > ratio > is only > >> > > ~7x. > >> > > > > > > With the "StringView" data type, only 16000000 bytes of > >> > > > > > > new > memory > >> > > > > > > would need to be allocated, and the performance should > >> > > > > > > be > only 2-4x > >> > > > > > > slower than the doubles case (because we only need to > relocate a > >> > > bunch > >> > > > > > > of 16-byte structs) instead of 16x slower. > >> > > > > > > > >> > > > > > > I hope you can see now that this can be a rather > >> > > > > > > serious > resource > >> > > > > > > utilization issue, both in processing time and memory use. > I will > >> > > > > > > update the document to explain this better and work on > responding > >> > > to > >> > > > > > > some of the other comments. > >> > > > > > > > >> > > > > > > Wes > >> > > > > > > > >> > > > > > > On Tue, Dec 14, 2021 at 5:08 AM Antoine Pitrou < > anto...@python.org > >> > > > > >> > > > > wrote: > >> > > > > > > > > >> > > > > > > > > >> > > > > > > > Hello, > >> > > > > > > > > >> > > > > > > > I think my main concern is how we can prevent the > community from > >> > > > > > > > fragmenting too much over supported encodings. The > >> > > > > > > > more > complex > >> > > the > >> > > > > > > > encodings, the less likely they are to be supported > >> > > > > > > > by > all main > >> > > > > > > > implementations. We see this in Parquet where the > efficient > >> > > "delta" > >> > > > > > > > encodings have just received support in Parquet C++, > >> > > > > > > > and > even, > >> > > only > >> > > > > on > >> > > > > > > > the read side. > >> > > > > > > > > >> > > > > > > > There is an additional subtlety in that Arrow is not > >> > > > > > > > a > storage > >> > > > > mechanism > >> > > > > > > > but it represents data in memory, so pieces doing > computation > >> > > have > >> > > > > to be > >> > > > > > > > adapted to the new encodings, for example the entire > library of > >> > > > > > > > computation kernels in Arrow C++ (of course, an easy > >> > > > > > > > but > >> > > inefficient > >> > > > > > > > adaptation is to always unpack to an already > >> > > > > > > > supported > layout). > >> > > > > > > > > >> > > > > > > > As an anecdote, the Arrow C++ kernels are supposed to > accept a > >> > > > > selection > >> > > > > > > > vector to filter their physical inputs, but none > >> > > > > > > > actually > >> > > supports > >> > > > > it. > >> > > > > > > > I think we should be wary of adding ambitious new > features that > >> > > might > >> > > > > > > > never get an actual implementation. > >> > > > > > > > > >> > > > > > > > > >> > > > > > > > On the detail of the proposed encodings: > >> > > > > > > > > >> > > > > > > > - I hope we can avoid storing raw pointers instead of > offsets > >> > > into a > >> > > > > > > > separate buffer; I understand the flexibility > >> > > > > > > > argument for > >> > > pointers > >> > > > > but > >> > > > > > > > it will also make data transfer more complicated > >> > > > > > > > > >> > > > > > > > - Constant arrays are a special case of RLE arrays > >> > > > > > > > and > I'm not > >> > > sure > >> > > > > > > > doing both is really useful > >> > > > > > > > > >> > > > > > > > - I don't really understand the concrete use case for > >> > > > > > > > the > weird > >> > > > > > > > "sequence view" layout; I'll note that non-monotonic > offsets can > >> > > make > >> > > > > > > > linear traversal less efficient, since the CPU won't > >> > > automatically > >> > > > > > > > prefetch data for you > >> > > > > > > > > >> > > > > > > > - The proposed RLE encoding seems inefficient; > >> > > > > > > > usually, > RLE > >> > > encodings > >> > > > > > > > try hard to minimize the size overhead of RLE > >> > > > > > > > sequences, > such > >> > > that > >> > > > > they > >> > > > > > > > become beneficial even for very short repeated runs > >> > > > > > > > > >> > > > > > > > Regards > >> > > > > > > > > >> > > > > > > > Antoine. > >> > > > > > > > > >> > > > > > > > > >> > > > > > > > > >> > > > > > > > > >> > > > > > > > Le 10/12/2021 à 20:28, Wes McKinney a écrit : > >> > > > > > > > > > >> > > > > > > > > This topic may provoke , but, given that Arrow is > approaching > >> > > its > >> > > > > > > > > 6-year anniversary, I think this is an important > discussion > >> > > about > >> > > > > how > >> > > > > > > > > we can thoughtfully expand the Arrow specifications > >> > > > > > > > > to > support > >> > > > > > > > > next-generation columnar data processing. In recent > times, I > >> > > have > >> > > > > been > >> > > > > > > > > motivated by recent interactions with CWI's DuckDB > >> > > > > > > > > and > Meta's > >> > > Velox > >> > > > > > > > > open source projects and the innovations they've > >> > > > > > > > > made > around > >> > > data > >> > > > > > > > > representation providing beneficial features above > >> > > > > > > > > and > beyond > >> > > what > >> > > > > we > >> > > > > > > > > have already in Arrow. For example, they have a > >> > > > > > > > > 16-byte > "string > >> > > > > view" > >> > > > > > > > > data type that enables buffer memory reuse, faster > "false" > >> > > > > comparisons > >> > > > > > > > > on strings unequal in the first 4 bytes, and inline > small > >> > > strings. > >> > > > > > > > > Both the Rust and C++ query engine efforts could > potentially > >> > > > > benefit > >> > > > > > > > > from this (not sure about the memory safety > implications in > >> > > Rust, > >> > > > > > > > > comments around this would be helpful). > >> > > > > > > > > > >> > > > > > > > > I wrote a document to start a discussion about a > >> > > > > > > > > few > new ways > >> > > to > >> > > > > > > > > represent data that may help with building > >> > > > > > > > > Arrow-native/Arrow-compatible query engines: > >> > > > > > > > > > >> > > > > > > > > > >> > > > > > >> > > > https://docs.google.com/document/d/12aZi8Inez9L_JCtZ6gi2XDbQpCsHICNy9_ > EUxj4ILeE/edit# > >> > > > > > > > > > >> > > > > > > > > Each of these potential additions would need to be > eventually > >> > > split > >> > > > > > > > > off into independent efforts with associated > >> > > > > > > > > additions > to the > >> > > > > columnar > >> > > > > > > > > specification, IPC format, C ABI, integration > >> > > > > > > > > tests, > and so on. > >> > > > > > > > > > >> > > > > > > > > The document is open to anyone to comment but if > >> > > > > > > > > anyone > would > >> > > like > >> > > > > > > > > edit access please feel free to request and I look > forward to > >> > > the > >> > > > > > > > > discussion. > >> > > > > > > > > > >> > > > > > > > > Thanks, > >> > > > > > > > > Wes > >> > > > > > > > > > >> > > > > > >> > > >