I marked the C++ implementation PR ready for review today and will soon be working on the Go implementation.
https://github.com/apache/arrow/pull/35345 Note that differently from Velox's ArrayVector, the Arrow implementation (ListView) also features a 64-bit version (LargeListView) to be symmetrical with the existing List and LargeList types. -- Felipe On Tue, Aug 22, 2023 at 12:09 AM Andrew Lamb <al...@influxdata.com> wrote: > I am convinced that the benefits of the ArrayViews as discussed on this > thread, despite the inconvenience of two similar formats (ListArray and > ArrayView) and equivalent formats, are enough to add it to the Arrow spec. > > It is my opinion we should add ArrowView to the Arrow format, and really > push forward the notion of composable data management. I think it is a > compelling vision and one Arrow can very much help to create. > > Thank you for your well worded summary Pedro and for everyone else who has > shared their opinion on this issue. > > Andrew > > On Thu, Aug 17, 2023 at 11:58 AM Pedro Eugenio Rocha Pedreira > <pedro...@meta.com.invalid> wrote: > > > Hi all, > > > > Getting back to this thread as I realize there were a few unanswered > > questions. Adding a bit more context on the rationale and usage of > > ArrayViews in Velox, and the importance to standardize it: > > > > re: Why do we need it? > > > > We use ArrayViews for two main reasons. First, for efficient execution of > > conditionals (IF/SWITCHs) that generate list types (array/maps). In a > > vectorized engine, there are two ways to execute conditionals. One will > > always first generate the condition bitmap that decides which branch each > > row will take, otherwise you would pay the dispatch cost for each row. > Once > > the bitmap is generated, without ArrayView, one would write each branch > to > > its own buffer output (one containing the "then" rows, and one with the > > "else" rows), then at the end conduct an additional step to merge these > two > > into a third output buffer, taking the condition bitmap into account. > This > > is analogous to a "scatter" operation. > > > > With ArrayViews one can write them to their final location with a single > > pass over each branch, removing the need for the final step. This is > always > > faster as it removes the need for a last step. As raised before, the > actual > > perf improvement will depend on the data types, their average sizes, cost > > of evaluating conditions, branches, buffer sizes and many other factors. > > One can play with these factors and tweak a microbenchmark to make it > more > > of more or less evident (as convenient), but the fact is that it will > > always be faster. We have found extensive empirical data in real > workloads > > to support these claims. We have found that it is common to find > workloads > > with large nested conditional statements rearranging maps for feature > > engineering workloads. Often, they happen right after table scans, so the > > additional copies and cache misses can really add up. > > > > The second reason is convenience for slicing array/maps. One can think of > > array_slice(), array_trim(), and similar functions (and their map > > counterparts). With ArrayViews as output they can be implemented > zero-copy > > by only adjusting the offsets and lengths, without ever touching (or even > > having to decode) the base elements. This is also faster on every case. > > > > For these reasons, I would argue that the ArrayView format is the > > state-of-the-art representation for any modern vectorized query engine > that > > really cares about performance (like Velox, DuckDB, and similar). If > > there's no Arrow counterpart for it, they (we) will continue diverging > from > > Arrow, and we might end up missing the mark of standardization. > > > > re: Where do we need it? > > > > An interesting point raised is that this could be seen as an > > implementation detail inside these engines, and that we might not need it > > at the engine boundary. While this is a fair way of looking at it, we've > > found two use cases where this presents a hurdle: > > > > First, in the Gluten project that enables Velox to be integrated within > > Spark, for example, the communication between Velox and Spark java is > done > > via JNI. In this case, the very data coming from a table scan + > projection, > > which in many cases will be represented using an ArrayView in Velox, > needs > > to be efficiently shipped to the Java-land. Without ArrayView support, > this > > communication will not be done via Arrow as we would rather specialize > than > > incur in an extra copy. The same for other bridges between systems, like > a > > DuckDB<->Velox present in Velox, which for the same reason (in addition > to > > historical lack of StringView) is not built using Arrow today. > > > > The second, as mentioned by Will, is UDF support. For these use cases, it > > would be very convenient to support UDF packages in different languages > > written based on the Arrow format. However, that means again that data > > coming from a Velox plan could be using the ArrayView format. Once more, > > requiring an extra copy on that step will end up pushing developers to > > build a more specialized API instead. > > > > In general, I recognize the need for stability and simplicity in the > > format, but wonder if inflexibility given the evolving state-of-the-art > > could lead to further divergence and obsolescence. Weston's proposal of > > canonical alternative layouts seems like a fine way to avoid this > pitfall, > > IMHO. > > > > re: Which libraries need it? > > > > One last point to be made is that though it is useful to have ArrayView > > support in the Arrow C++ library, most of the systems I described do not > > use it and rather have their own implementation. The most important > piece, > > in our view, is the unified and agreed upon memory layout, and a way to > > push this through the stable C ABI. > > > > re: why not dictionaries? > > > > The suggested dictionary over variable-sized arrays approach can be used > > out-of-the-box and indeed gives you some flexibility, but there are two > > shortcomings: (a) like Felipe pointed out, it incurs in an extra > > indirection (cache miss) in order to fetch the array, (b) it still > doesn't > > provide the same flexibility to slice the array/map (you can reorder > > arrays, but not slice them). > > > > I hope this adds a bit more context on the usage and relevance of this > > format in modern vectorized engines. For further reading, our VLDB paper > > has a section explaining in more detail the reasoning behind our arrow > > divergence [0] in Section 4.2.1. As a proud supporter of composability in > > data management [1], I really hope we can find ways to converge our > > standards and build more composable data stacks. > > > > [0] - https://vldb.org/pvldb/vol15/p3372-pedreira.pdf > > [1] - https://dl.acm.org/doi/10.14778/3603581.3603604 > > > > Best, > > -- > > Pedro Pedreira > > >