> Can't implementations add support as needed? I assume that the "depending on what support [it] aspires to" implies this, but if a feature isn't used in a community then it can leave it unimplemented. On the flip side, if it is used in a community (e.g. C++) is there no way to upstream it without the support of every community?
I think that is something that is more tolerable for something like REE or dictionary support which is purely additive (e.g. JS and C# don't support unions yet and can get around to it when it is important). The challenge for this kind of "alternative layout" is that you start to get a situation where some implementations choose "option A" and others choose "option B" and it's not clearly a case of "this is a feature we haven't added support for yet". On Wed, Jun 14, 2023 at 2:01 PM Antoine Pitrou <anto...@python.org> wrote: > > So each community would have its own version of the Arrow format? > > > Le 14/06/2023 à 22:47, Aldrin a écrit : > > > Arrow has at least 7 native "official" implementations... 5 bindings > > on C++... and likely other implementations (like arrow2 in rust) > > > >> I think it is worth remembering that depending on what level of support > > ListView aspires to, such an addition could require non trivial changes > to > > many / all of those implementations (and the APIs they expose). > > > > Can't implementations add support as needed? I assume that the > > "depending on what support [it] aspires to" implies this, but if a > > feature isn't used in a community then it can leave it unimplemented. On > > the flip side, if it is used in a community (e.g. C++) is there no way > > to upstream it without the support of every community? > > > > > > > > Sent from Proton Mail for iOS > > > > > > On Wed, Jun 14, 2023 at 13:06, Raphael Taylor-Davies > > <r.taylordav...@googlemail.com.INVALID <mailto:On Wed, Jun 14, 2023 at > > 13:06, Raphael Taylor-Davies <<a href=>> wrote: > >> Even something relatively straightforward becomes a huge implementation > >> effort when multiplied by a large number of codebases, users and > >> datasets. Parquet is a great source of historical examples of the > >> challenges of incremental changes that don't meaningfully unlock new > >> use-cases. To take just one, Int96 was deprecated almost a decade ago, > >> in favour of some additional metadata over an existing physical layout, > >> and yet Int96 is still to the best of my knowledge used by Spark by > >> default. > >> > >> That's not to say that I think the arrow specification should ossify and > >> we should never change it, but I'm not hugely enthusiastic about adding > >> encodings that are only incremental improvements over existing > encodings. > >> > >> I therefore wonder if there are some new use-cases I am missing that > >> would be unlocked by this change, and that wouldn't be supported by the > >> dictionary proposal? Perhaps you could elaborate here? Whilst I do agree > >> using dictionaries as proposed is perhaps a less elegant solution, I > >> don't see anything inherently wrong with it, and if it ain't broke we > >> really shouldn't be trying to fix it. > >> > >> Kind Regards, > >> > >> Raphael Taylor-Davies > >> > >> On 14 June 2023 17:52:52 BST, Felipe Oliveira Carvalho > >> <felipe...@gmail.com> wrote: > >> > >> General approach to alternative formats aside, in the specific case > >> of ListView, I think the implementation complexity is being > >> overestimated in these discussions. The C++ Arrow implementation > >> shares a lot of code between List and LargeList. And with some > >> tweaks, I'm able to share that common infrastructure for ListView as > >> well. [1] ListView is similar to list: it doesn't require offsets to > >> be sorted and adds an extra buffer containing sizes. For symmetry > >> with the List and LargeList types (FixedSizeList not included), I'm > >> going to propose we add a LargeListView. That is not part of the > >> draft implementation yet, but seems like an obvious thing to have > >> now that I implemented the `if_else` specialization. [2] David Li > >> asked about this above and I can confirm now that 64-bit version of > >> ListView (LargeListView) is in the plans. Trying to avoid > >> re-implementing some kernels is not a good goal to chase, IMO, > >> because kernels need tweaks to take advantage of the format. [1] > >> https://github.com/apache/arrow/pull/35345 [2] > >> https://github.com/felipecrv/arrow/commits/list_view_backup -- > >> Felipe On Wed, Jun 14, 2023 at 12:08 PM Weston Pace > >> <weston.p...@gmail.com> wrote: > >> > >> perhaps we could support this use-case as a canonical > >> extension type over dictionary encoded, variable-sized arrays > >> > >> I believe this suggestion is valid and could be used to solve > >> the if-else case. The algorithm, if I understand it, would be > >> roughly: ``` // Note: Simple pseudocode, vectorization left as > >> exercise for the reader auto indices_builder = ... auto > >> list_builder = ... indices_builder.resize(batch.length); Array > >> condition_mask = condition.EvaluateBatch(batch); for row_index > >> in selected_rows(condition_mask): indices_builder[row_index] = > >> list_builder.CurrentLength(); > >> list_builder.Append(if_expr.EvaluateRow(batch, row_index)) for > >> row_index in unselected_rows(condition_mask): > >> indices_builder[row_index] = list_builder.CurrentLength(); > >> list_builder.Append(else_expr.EvaluateRow(batch, row_index)) > >> return DictionaryArray(indices_builder.Finish(), > >> list_builder.Finish()) ``` I also agree this is a slightly > >> awkward use of dictionaries (e.g. the dictionary would have the > >> same size as the # of indices) and perhaps not the most > >> intuitive way to solve the problem. My gut reaction is simply > >> "an improved if/else kernel is not, alone, enough justification > >> for a new layout" and yet... I think we are seeing the start (I > >> hope) of a trend where Arrow is not just used "between systems" > >> (e.g. to shuttle data from one place to another, or between a > >> query engine and a visualization tool) but also "within systems" > >> (e.g. UDFs, bespoke file formats and temporary tables, between > >> workers in a distributed query engine). When arrow is used > >> "within systems" I think both the number of bespoke formats and > >> the significance of conversion cost increases. For example, it's > >> easy to say that Velox should convert at the boundary as data > >> leaves Velox. But what if Velox (or datafusion or ...) were to > >> define an interface for UDFs. Would we want to use Arrow there > >> (e.g. the C data interface is a good fit)? If so, wouldn't the > >> conversion cost be more significant? > >> > >> Also, I'm very lukewarm towards the concept of "alternative > >> layouts" suggested somewhere else in this thread. It does > >> not seem a good choice to complexify the Arrow format that > >> much. > >> > >> I think, in my opinion, this depends on how many of these > >> alternative layouts exist. If there are just a few, then I > >> agree, we should just adopt them as formal first-class layouts. > >> If there are many, then I think it will be too much complexity > >> in Arrow to have all the different choices. Or, we could say > >> there are many, but the alternatives don't belong in Arrow at > >> all. In that case I think it's the same question as the above > >> paragraph, "do we want Arrow to be used within systems? Or just > >> between systems?" On Wed, Jun 14, 2023 at 2:07 AM Antoine Pitrou > >> <anto...@python.org> wrote: > >> > >> I agree that ListView cannot be an extension type, given > >> that it features a new layout, and therefore cannot > >> reasonably be backed by an existing storage type (AFAICT). > >> Also, I'm very lukewarm towards the concept of "alternative > >> layouts" suggested somewhere else in this thread. It does > >> not seem a good choice to complexify the Arrow format that > >> much. Regards Antoine. Le 07/06/2023 à 00:21, Felipe > >> Oliveira Carvalho a écrit : > >> > >> +1 on what Ian said. And as I write kernels for this new > >> format, I’m learning that it’s > >> > >> possible > >> > >> to re-use the common infrastructure used by List and > >> LargeList to > >> > >> implement > >> > >> the ListView related features with some adjustments. IMO > >> having this format as a second-class citizen would more > >> likely complicate things because it would make this > >> unification harder. — Felipe On Tue, 6 Jun 2023 at 18:45 > >> Ian Cook <ianmc...@apache.org> wrote: > >> > >> To clarify why we cannot simply propose adding > >> ListView as a new “canonical extension type”: The > >> extension type mechanism in Arrow depends on the > >> underlying data being organized in an existing Arrow > >> layout—that way an implementation that does not > >> support the extension type can still handle the > >> underlying data. But ListView is a wholly new > >> layout. I strongly agree with Weston’s idea that it > >> is a good time for Arrow to introduce the notion of > >> “canonical alternative layouts.” Taken together, I > >> think that canonical extension types and canonical > >> alternative layouts could serve as an “incubator” > >> for proposed new representations. For example, if a > >> proposed canonical alternative layout ends up being > >> broadly adopted, then that will serve as a signal > >> that we should consider adding it as a primary > >> layout in the core spec. It seems to me that most > >> projects that are implementing Arrow today are not > >> aiming to provide complete coverage of Arrow; rather > >> they are adopting Arrow because of its role as a > >> standard and they are implementing only as much of > >> the Arrow standard as they require to achieve some > >> goal. I believe that such projects are important > >> Arrow stakeholders, and I believe that this proposed > >> notion of canonical alternative layouts will serve > >> them well and will create efficiencies by > >> standardizing implementations around a shared set of > >> alternatives. However I think that the documentation > >> for canonical alternative layouts should strongly > >> encourage implementers to default to using the > >> primary layouts defined in the core spec and only > >> use alternative layouts in cases where the primary > >> layouts do not meet their needs. On Sat, May 27, > >> 2023 at 7:44 PM Micah Kornfield < > >> > >> emkornfi...@gmail.com> > >> > >> wrote: > >> > >> This sounds reasonable to me but my main concern > >> is, I'm not sure > >> > >> there > >> > >> is > >> > >> a great mechanism to enforce canonical layouts > >> don't somehow become > >> > >> default > >> > >> (or the only implementation). Even for these new > >> layouts, I think it might be worth rethinking > >> > >> binding > >> > >> a > >> > >> layout into the schema versus having a different > >> concept of encoding > >> > >> (and > >> > >> changing some of the corresponding data > >> structures). On Mon, May 22, 2023 at 10:37 AM > >> Weston Pace <weston.p...@gmail.com> > >> > >> wrote: > >> > >> Trying to settle on one option is a > >> fruitless endeavor. Each type > >> > >> has > >> > >> pros > >> > >> and cons. I would also predict that the > >> largest existing usage of > >> > >> Arrow is > >> > >> shuttling data from one system to another. > >> The newly proposed > >> > >> format > >> > >> doesn't appear to have any significant > >> advantage for that use case > >> > >> (if > >> > >> anything, the existing format is arguably > >> better as it is more > >> > >> compact). > >> > >> I am very biased towards historical > >> precedent and avoiding breaking changes. We > >> have "canonical extension types", perhaps it > >> is time for > >> > >> "canonical > >> > >> alternative layouts". We could define it as > >> such: * There are one or more primary > >> layouts * Existing layouts are automatically > >> considered primary layouts, > >> > >> even if > >> > >> they wouldn't have been primary layouts > >> initially (e.g. large list) * A new layout, > >> if it is semantically equivalent to another, is > >> > >> considered > >> > >> an alternative layout * An alternative > >> layout still has the same requirements for > >> > >> adoption > >> > >> (two > >> > >> implementations and a vote) * An > >> implementation should not feel pressured to > >> rush and > >> > >> implement > >> > >> the > >> > >> new layout. It would be good if they > >> contribute in the discussion and > >> > >> consider the > >> > >> layout and vote if they feel it would be an > >> acceptable design. * We can define and vote > >> and approve as many canonical alternative > >> > >> layouts > >> > >> as we want: * A canonical alternative layout > >> should, at a minimum, have some reasonable > >> justification, such as improved performance for > >> > >> algorithm X > >> > >> * Arrow implementations MUST support the > >> primary layouts * An Arrow implementation > >> MAY support a canonical alternative, > >> > >> however: > >> > >> * An Arrow implementation MUST first support > >> the primary layout * An Arrow implementation > >> MUST support conversion to/from the > >> > >> primary > >> > >> and canonical layout * An Arrow > >> implementation's APIs MUST only provide data > >> in the alternative layout if it is > >> explicitly asked for (e.g. schema inference > >> > >> should > >> > >> prefer the primary layout). * We can still > >> vote for new primary layouts (e.g. promoting a > >> > >> canonical > >> > >> alternative) but, in these votes we don't > >> only consider the value (e.g. performance) of > >> > >> the > >> > >> layout > >> > >> but also the interoperability. In other > >> words, a layout can only become a primary > >> layout if > >> > >> there > >> > >> is > >> > >> significant evidence that most > >> implementations plan to adopt it. This lets > >> us evolve support for new layouts more > >> naturally. We can generally assume that > >> users will not, initially, be aware of these > >> alternative layouts. However, everything > >> will just work. They may > >> > >> start > >> > >> to see a performance penalty stemming from a > >> lack of support for > >> > >> these > >> > >> layouts. If this performance penalty becomes > >> significant then they > >> > >> will > >> > >> discover it and become aware of the problem. > >> They can then ask > >> > >> whatever > >> > >> library they are using to add support for > >> the alternative layout. > >> > >> As > >> > >> enough users find a need for it then > >> libraries will add support. Eventually, > >> enough libraries will support it that we can > >> adopt it > >> > >> as a > >> > >> primary layout. Also, it allows libraries to > >> adopt alternative layouts more > >> > >> aggressively if > >> > >> they would like while still hopefully > >> ensuring that we eventually > >> > >> all > >> > >> converge on the same implementation of the > >> alternative layout. On Mon, May 22, 2023 at > >> 9:35 AM Will Jones <will.jones...@gmail.com > >> > >> wrote: > >> > >> Hello Arrow devs, I don't understand why > >> we would start deprecating features in the > >> > >> Arrow > >> > >> format. Even starting this talk > >> might already be a bad idea > >> > >> PR-wise. > >> > >> I agree we don't want to make breaking > >> changes to the Arrow format. > >> > >> But > >> > >> several maintainers have already stated > >> they have no interest in maintaining > >> both list types with full compute > >> functionality [1][2], > >> > >> so I > >> > >> think it's very likely one list type or > >> the other will be implicitly preferred > >> in the ecosystem if this data type was > >> added. > >> > >> If > >> > >> that's the case, I'd prefer that we > >> agreed as a community which one > >> > >> should > >> > >> be preferred. Maybe that's not the best > >> path; it's just one way for > >> > >> us to > >> > >> balance stability, maintenance burden, > >> and relevance. Can someone help distill > >> down the primary rationale and usecase for > >> > >> adding ArrayView to the Arrow Spec? > >> > >> Looking back at that old thread, I think > >> one of the main > >> > >> motivations > >> > >> is > >> > >> to > >> > >> try to prevent query engine implementers > >> from feeling there is a > >> > >> tradeoff > >> > >> between having state-of-the-art > >> performance and being Arrow-native. > >> > >> For > >> > >> some of the new array types, we had both > >> Velox and DuckDB use them, > >> > >> so it > >> > >> was reasonable to expect they were > >> innovations that might > >> > >> proliferate. > >> > >> I'm > >> > >> not sure if the ArrayView is part of > >> that. From Wes earlier [3]: The idea is > >> that in a world of data and query > >> federation (for > >> > >> example, > >> > >> consider [1] where Arrow is being > >> used as a data federation layer > >> > >> with > >> > >> many > >> > >> query engines), we want to increase > >> the amount of data in-flight > >> > >> and > >> > >> in-memory that is in Arrow format. > >> So if query engines are having > >> > >> to > >> > >> depart > >> > >> substantially from the Arrow format > >> to get performance, then this > >> > >> creates a > >> > >> potential lose-lose situation: * > >> Depart from Arrow: get better > >> > >> performance > >> > >> but pay serialization costs to read > >> and write Arrow (the > >> > >> performance > >> > >> and > >> > >> resource utilization benefits > >> outweigh the serialization costs). > >> > >> This > >> > >> puts > >> > >> additional pressure on query engines > >> to build specialized > >> > >> components > >> > >> for > >> > >> solving problems rather than making > >> use of off-the-shelf > >> > >> components > >> > >> that > >> > >> use Arrow. This has knock-on effects > >> on ecosystem fragmentation. * > >> > >> Or > >> > >> use > >> > >> Arrow, and accept suboptimal query > >> processing performance > >> > >> Will mentions one usecase is Velox > >> consuming python UDF output, > >> > >> which > >> > >> seems > >> > >> to be mostly about how fast Velox > >> can consume this format, not how > >> > >> fast > >> > >> it > >> > >> can be written. Are there other > >> usecases? > >> > >> To be clear, I don't know if that's the > >> use case they want. That's > >> > >> just > >> > >> me > >> > >> speculating. I still have some questions > >> myself: 1. Is this array type currently > >> only used in Velox? (not DuckDB > >> > >> like > >> > >> some > >> > >> of the other new types?) What evidence > >> do we have that it will > >> > >> become > >> > >> used > >> > >> outside of Velox? 2. We already have > >> three list types: list, large list (64-bit > >> > >> offsets), > >> > >> and > >> > >> fixed size list. Do we think we will > >> only want a view version of > >> > >> the > >> > >> 32-bit > >> > >> offset variable length list? Or are we > >> potentially talking about > >> > >> view > >> > >> variants for all three? Best, Will Jones > >> [1] > >> > >> https://lists.apache.org/thread/smn13j1rnt23mb3fwx75sw3f877k3nwx > >> > >> [2] > >> > >> https://lists.apache.org/thread/cc4w3vs3foj1fmpq9x888k51so60ftr3 > >> > >> [3] > >> > >> https://lists.apache.org/thread/mk2yn62y6l8qtngcs1vg2qtwlxzbrt8t > >> > >> On Mon, May 22, 2023 at 3:48 AM Andrew > >> Lamb <al...@influxdata.com> > >> > >> wrote: > >> > >> Can someone help distill down the > >> primary rationale and usecase > >> > >> for > >> > >> adding ArrayView to the Arrow Spec? > >> From the above discussions, the > >> stated rationale seems to be fast > >> (zero-copy) interchange with Velox. > >> This thread has qualitatively > >> enumerated the benefits of > >> > >> (offset+len) > >> > >> encoding over the existing Arrow > >> ListArray (offets) approach, but > >> > >> I > >> > >> haven't > >> > >> seen any performance measurements > >> that might help us to gauge the > >> > >> tradeoff > >> > >> in additional complexity vs runtime > >> overhead. Will mentions one usecase > >> is Velox consuming python UDF output, > >> > >> which > >> > >> seems > >> > >> to be mostly about how fast Velox > >> can consume this format, not how > >> > >> fast > >> > >> it > >> > >> can be written. Are there other > >> usecases? Do we have numbers showing > >> how much overhead converting to /from > >> > >> Velox's > >> > >> internal representation and the > >> existing ListArray adds? Has > >> > >> anyone in > >> > >> Velox land considered adding faster > >> support for Arrow style > >> > >> ListArray > >> > >> encoding? Andrew On Mon, May 22, > >> 2023 at 4:38 AM Antoine Pitrou < > >> > >> anto...@python.org > >> > >> wrote: > >> > >> Hi, I don't understand why we > >> would start deprecating features > >> in the > >> > >> Arrow > >> > >> format. Even starting this talk > >> might already be a bad idea > >> > >> PR-wise. > >> > >> As for implementing conversions > >> at the I/O boundary, it's a > >> > >> reasonably > >> > >> policy, but it still requires > >> work by implementors and it's not > >> > >> granted > >> > >> that all consumers of the Arrow > >> format will grow such > >> conversions if/when we add > >> non-trivial types such as > >> ListView or StringView. Regards > >> Antoine. Le 22/05/2023 à 00:39, > >> Will Jones a écrit : > >> > >> One more thing: Looking back > >> on the previous discussion[1] > >> > >> (which > >> > >> Weston > >> > >> pointed out in his earlier > >> message), Jorge suggested > >> that the > >> > >> old > >> > >> list > >> > >> types might be deprecated in > >> favor of view variants [2]. > >> Others > >> > >> were > >> > >> worried that it might > >> undermine the perception > >> that the Arrow > >> > >> format > >> > >> is > >> > >> stable. I think it might be > >> worth thinking about "soft > >> > >> deprecating" > >> > >> the > >> > >> old > >> > >> list type: suggesting new > >> implementations prefer the list > >> > >> view, but > >> > >> reassuring that > >> implementations should > >> support the old format, > >> > >> even > >> > >> if > >> > >> they > >> > >> just convert to the new > >> format. To be clear, this > >> wouldn't > >> > >> mean we > >> > >> plan > >> > >> to > >> > >> create breaking changes in > >> the format; but if we ever > >> did for > >> > >> other > >> > >> reasons, the old list type > >> might go. Arrow compute > >> libraries could choose > >> either format for compute > >> > >> support, > >> > >> and > >> > >> plan to do conversion at the > >> boundaries. Libraries that use > >> > >> the new > >> > >> type > >> > >> will have cheap conversion > >> when reading the old type. > >> Meanwhile > >> > >> those > >> > >> that > >> > >> are still on the old type > >> will have some incentive to > >> move > >> > >> towards > >> > >> the > >> > >> new > >> > >> one, since that conversion > >> will not be as efficient. [1] > >> > >> https://lists.apache.org/thread/49qzofswg1r5z7zh39pjvd1m2ggz2kdq > >> > >> > >> [2] > >> > >> https://lists.apache.org/thread/smn13j1rnt23mb3fwx75sw3f877k3nwx > >> > >> > >> On Sun, May 21, 2023 at > >> 3:07 PM Will Jones < > >> > >> will.jones...@gmail.com> > >> > >> wrote: > >> > >> Hello, I think Sasha > >> brings up a good point, > >> that the advantages of > >> > >> this > >> > >> format > >> > >> seem to be primarily > >> about query processing. > >> Other encodings > >> > >> like > >> > >> REE > >> > >> and > >> > >> dictionary have > >> space-saving advantages > >> that justify them > >> > >> simply > >> > >> in > >> > >> terms > >> > >> of space efficiency > >> (although they have > >> query processing > >> > >> advantages > >> > >> as > >> > >> well). As discussed, > >> most use cases are > >> already well served by > >> > >> existing > >> > >> list types and > >> dictionary encoding. I > >> agree that there are > >> cases where transferring > >> this type > >> > >> without > >> > >> conversion would be > >> ideal. One use case I > >> can think of is if > >> > >> Velox > >> > >> wants to > >> > >> be able to take > >> Arrow-based UDFs > >> (possibly written with > >> > >> PyArrow, > >> > >> for > >> > >> example) that operate on > >> this column type and > >> therefore wants > >> > >> zero-copy > >> > >> exchange over the C Data > >> Interface. One big > >> question I have: we > >> already have three list > >> types: > >> > >> list, > >> > >> large > >> > >> list (64-bit offsets), > >> and fixed size list. Do > >> we think we > >> > >> will > >> > >> only > >> > >> want a > >> > >> view version of the > >> 32-bit offset variable > >> length list? Or > >> > >> are we > >> > >> potentially talking > >> about view variants for > >> all three? Best, Will > >> Jones On Sun, May 21, > >> 2023 at 2:19 PM Felipe > >> Oliveira Carvalho < > >> felipe...@gmail.com> wrote: > >> > >> The benefit of > >> having a memory > >> format that’s > >> friendly to > >> > >> non-deterministic > >> > >> order writes is > >> unlocked by the > >> transport and > >> processing of > >> > >> the > >> > >> data > >> > >> being > >> > >> agnostic to the > >> physical order as > >> much as possible. > >> Requiring a > >> conversion could > >> cancel out that > >> benefit. But it > >> > >> can > >> > >> be a > >> > >> provisory step for > >> compatibility > >> between systems that > >> don’t > >> > >> understand > >> > >> the > >> > >> format yet. This is > >> similar to the > >> situation with > >> compression > >> > >> schemes > >> > >> like > >> > >> run-end encoding — > >> the goal is > >> processing the > >> compressed data > >> > >> directly > >> > >> without an expansion > >> step whenever > >> possible. This is > >> why having it as > >> part of the open > >> Arrow format is so > >> > >> important: > >> > >> everyone can agree > >> on a format that’s > >> friendly to parallel > >> > >> and/or > >> > >> vectorized compute > >> kernels without > >> introducing multiple > >> > >> incompatible > >> > >> formats to the > >> ecosystem and > >> without imposing a > >> conversion > >> > >> step > >> > >> between > >> > >> the different > >> systems. — Felipe On > >> Sat, 20 May 2023 at > >> 20:04 Aldrin > >> > >> <octalene....@pm.me.invalid> > >> > >> wrote: > >> > >> I don't feel > >> like this > >> representation > >> is necessarily a > >> > >> detail of > >> > >> the > >> > >> query > >> > >> engine, but I am > >> also not sure > >> why this > >> representation > >> would > >> > >> have > >> > >> to > >> > >> be > >> > >> converted to a > >> non-view format > >> when > >> serializing. > >> Could you > >> > >> clarify > >> > >> that? My > >> > >> impression is > >> that this > >> representation > >> could be used for > >> > >> persistence > >> > >> or > >> > >> data transfer, > >> though it can be > >> more complex to > >> guarantee > >> > >> the > >> > >> portion > >> > >> of > >> > >> the buffer that > >> an index points > >> to is also > >> present in > >> > >> memory. > >> > >> Sent from Proton > >> Mail for iOS On > >> Sat, May 20, > >> 2023 at 15:00, > >> Sasha Krassovsky < > >> > >> krassovskysa...@gmail.com > >> > >> > >> <On+Sat,+May+20,+2023+at+15:00,+Sasha+Krassovsky+%3C%3Ca+href=>> > >> > >> > >> wrote: > >> > >> Hi everyone, I > >> understand that > >> there are > >> numerous > >> benefits to this > >> > >> representation > >> > >> during query > >> processing, but > >> would it be fair > >> to say that > >> > >> this > >> > >> is > >> > >> an > >> > >> implementation > >> detail of the > >> query engine? > >> Query engines > >> > >> don’t > >> > >> necessarily > >> > >> need to conform > >> to the Arrow > >> format > >> internally, only at > >> > >> ingest/egress > >> > >> points, and > >> performing a > >> conversion from > >> the non-view to > >> > >> view > >> > >> format > >> > >> seems > >> > >> like it would be > >> very cheap > >> (though I > >> understand not > >> > >> necessarily > >> > >> the > >> > >> other > >> > >> way around, but > >> you’d need to do > >> that anyway if > >> you’re > >> > >> serializing). > >> > >> Sasha Krassovsky > >> > >> 20 мая 2023 > >> г., в 13:00, > >> Will Jones < > >> > >> will.jones...@gmail.com> > >> > >> написал(а): > >> > >> Thanks for > >> sharing > >> these > >> details, > >> Pedro. The > >> conditional > >> > >> branches > >> > >> argument > >> > >> makes a lot > >> of sense to > >> me. The > >> tensors > >> point brings > >> up some > >> interesting > >> issues. For > >> > >> now, > >> > >> we've > >> > >> defined > >> > >> our only > >> tensor > >> extension > >> type to be > >> built on a > >> fixed size > >> > >> list. > >> > >> If a > >> > >> use > >> > >> case of this > >> might be > >> manipulating > >> tensors with > >> zero copy, > >> > >> perhaps > >> > >> that > >> > >> suggests > >> that we want > >> a fixed size > >> list > >> variant? In > >> > >> addition, > >> > >> would > >> > >> we > >> > >> have > >> > >> to define > >> another > >> extension > >> type to be a > >> ListView > >> variant? > >> > >> Or > >> > >> would > >> > >> we > >> > >> want > >> > >> to think > >> about making > >> extension > >> types > >> somehow > >> valid across > >> > >> various > >> > >> encodings of > >> the same > >> "logical type"? > >> > >> On Fri, > >> May 19, > >> 2023 at > >> 1:59 PM > >> Pedro > >> Eugenio > >> Rocha > >> > >> Pedreira > >> > >> <pedro...@meta.com.invalid> > >> wrote: > >> Hi all, > >> This is > >> Pedro > >> from the > >> Velox > >> team at > >> Meta. > >> This is my > >> > >> first > >> > >> time > >> > >> here, > >> > >> so > >> > >> nice to > >> e-meet > >> you! > >> Adding > >> to what > >> Felipe > >> said, > >> the main > >> reason > >> we created > >> > >> “ListView” > >> > >> (though > >> > >> we just > >> call > >> them > >> ArrayVector/MapVector > >> in > >> Velox) > >> is that, > >> > >> along > >> > >> with > >> > >> StringViews > >> for > >> strings, > >> they > >> allow us > >> to write > >> any > >> > >> columnar > >> > >> buffer > >> > >> out-or-order, > >> regardless > >> of their > >> types or > >> encodings. > >> > >> This is > >> > >> naturally > >> > >> doable > >> for all > >> primitive > >> types > >> (fixed-size), > >> but not for > >> > >> types > >> > >> that > >> > >> don’t > >> > >> have > >> fixed > >> size and > >> are > >> required > >> to be > >> contiguous. > >> The > >> > >> StringView > >> > >> and > >> > >> ListView > >> formats > >> allow us > >> to keep > >> this > >> invariant > >> in > >> Velox. > >> Being > >> able to > >> write > >> vectors > >> out-of-order > >> is > >> useful when > >> > >> executing > >> > >> conditionals > >> like > >> IF/SWITCH > >> statements, > >> which are > >> > >> pervasive > >> > >> among > >> > >> our > >> > >> workloads. > >> To fully > >> vectorize > >> it, one > >> first > >> evaluates > >> the > >> > >> expression, > >> > >> then > >> > >> generate > >> a bitmap > >> containing > >> which > >> rows > >> take the > >> THEN and > >> > >> which > >> > >> take > >> > >> the > >> > >> ELSE > >> branch. > >> Then you > >> populate > >> all rows > >> that > >> match the > >> > >> first > >> > >> branch > >> > >> by > >> > >> evaluating > >> the THEN > >> expression > >> in a > >> vectorized > >> > >> (branch-less > >> > >> and > >> > >> cache > >> > >> friendly) > >> way, and > >> subsequently > >> the ELSE > >> branch. > >> If you > >> > >> can’t > >> > >> write > >> > >> them > >> > >> out-of-order, > >> you > >> would > >> either > >> have a > >> big > >> branch > >> per row > >> > >> dispatching > >> > >> to > >> > >> the > >> > >> right > >> expression > >> (slow), > >> or > >> populate > >> two > >> distinct > >> vectors > >> > >> then > >> > >> merging > >> > >> them > >> > >> at the > >> end > >> (potentially > >> even > >> slower). > >> How much > >> faster our > >> > >> approach > >> > >> is > >> > >> highly > >> depends > >> on the > >> buffer > >> sizes > >> and > >> expressions, > >> but we > >> > >> found > >> > >> it > >> > >> to > >> > >> be > >> > >> faster > >> enough > >> on > >> average > >> to > >> justify > >> us > >> extending > >> the > >> > >> underlying > >> > >> layout. > >> > >> With > >> that > >> said, > >> this is > >> all > >> within a > >> single > >> thread of > >> > >> execution. > >> > >> Parallelization > >> is done > >> by > >> feeding > >> each > >> thread > >> its own > >> > >> vector/data. > >> > >> As > >> > >> pointed > >> out in a > >> previous > >> message, > >> this > >> also > >> gives > >> you the > >> > >> flexibility > >> > >> to > >> > >> implement > >> cardinality > >> increasing/reducing > >> operations, > >> but > >> > >> we > >> > >> don’t > >> > >> use > >> > >> it > >> > >> for that > >> purpose. > >> Operations > >> like > >> filtering, > >> joining, > >> > >> unnesting > >> > >> and > >> > >> similar > >> > >> are done > >> by > >> wrapping > >> the > >> internal > >> vector > >> in a > >> dictionary, > >> > >> as > >> > >> these > >> > >> need > >> > >> to > >> > >> work not > >> only on > >> “ListViews” > >> but on > >> any data > >> types with > >> > >> any > >> > >> encoding. > >> > >> There > >> > >> are more > >> details > >> on > >> Section > >> 4.2.1 in > >> [1] > >> Beyond > >> this, it > >> also > >> gives > >> function/kernel > >> developers > >> more > >> > >> flexibility > >> > >> to > >> > >> implement > >> operations > >> that > >> manipulate > >> Arrays/Maps. > >> For > >> > >> example, > >> > >> operations > >> > >> that > >> slice > >> these > >> containers > >> can be > >> implemented > >> in a > >> > >> zero-copy > >> > >> manner > >> > >> by > >> > >> just > >> rearranging > >> the > >> lengths/offsets > >> indices, > >> without > >> ever > >> > >> touching > >> > >> the > >> > >> larger > >> internal > >> buffers. > >> This is > >> a > >> similar > >> motivation > >> as > >> > >> for > >> > >> StringView > >> > >> (think > >> of > >> substr(), > >> trim(), > >> and > >> similar). > >> One nice > >> last > >> > >> property > >> > >> is > >> > >> that > >> > >> this > >> layout > >> allows > >> for > >> overlapping > >> ranges. > >> This is > >> > >> something > >> > >> discussed > >> > >> with > >> > >> our ML > >> people > >> to allow > >> deduping > >> feature > >> values > >> in a tensor > >> > >> (which > >> > >> is > >> > >> fairly > >> > >> common), > >> but not > >> something > >> we have > >> leveraged > >> just > >> yet. [1] > >> - > >> https://vldb.org/pvldb/vol15/p3372-pedreira.pdf > >> Best, -- > >> Pedro > >> Pedreira > >> ------------------------------------------------------------------------ > >> From: > >> Felipe > >> Oliveira > >> Carvalho > >> <felipe...@gmail.com> > >> Sent: > >> Friday, > >> May 19, > >> 2023 > >> 10:01 AM > >> To: > >> dev@arrow.apache.org > >> <dev@arrow.apache.org> > >> Cc: > >> Pedro > >> Eugenio > >> Rocha > >> Pedreira > >> <pedro...@meta.com> > >> Subject: > >> Re: > >> [DISCUSS][Format] > >> Starting > >> the draft > >> > >> implementation > >> > >> of > >> > >> the > >> > >> ArrayView > >> array > >> format > >> +pedroerp > >> On Thu, > >> 11 May > >> 2023 at > >> 17: 51 > >> Raphael > >> > >> Taylor-Davies > >> > >> <r. > >> > >> taylordavies@ > >> googlemail. > >> com. > >> invalid> > >> wrote: > >> Hi All, > > >> > >> if > >> > >> we > >> > >> added > >> > >> this, do > >> we think > >> many > >> Arrow > >> and > >> query > > >> engine > >> > >> implementations > >> > >> (for > >> > >> example, > >> DataFusion) > >> will be > >> ZjQcmQRYFpfptBannerStart > >> This > >> Message > >> Is From > >> an > >> External > >> Sender > >> ZjQcmQRYFpfptBannerEnd > >> +pedroerp > >> On Thu, > >> 11 May > >> 2023 at > >> 17:51 > >> Raphael > >> Taylor-Davies > >> <r.taylordav...@googlemail.com.invalid> > >> wrote: > >> Hi All, > >> > >> if > >> we > >> added > >> this, > >> do > >> we > >> think > >> many > >> Arrow > >> and > >> query > >> engine > >> implementations > >> (for > >> example, > >> DataFusion) > >> will be > >> > >> eager > >> > >> to > >> > >> add > >> > >> full > >> > >> support > >> for > >> the > >> type, > >> including > >> compute > >> kernels? > >> Or are > >> > >> they > >> > >> likely > >> > >> to > >> > >> just > >> > >> convert > >> this > >> type > >> to > >> ListArray > >> at > >> import > >> boundaries? > >> > >> > >> I can't > >> speak > >> for > >> query > >> engines > >> in > >> general, > >> but at > >> least > >> > >> for > >> > >> arrow-rs > >> > >> and by > >> extension > >> DataFusion, > >> and > >> based on > >> my current > >> > >> understanding > >> > >> of > >> > >> the > >> use-cases > >> I would > >> be > >> rather > >> hesitant > >> to add > >> support > >> > >> to the > >> > >> kernels > >> > >> for this > >> array > >> type, > >> definitely > >> instead > >> favouring > >> > >> conversion > >> > >> at > >> > >> the > >> > >> edges. > >> We > >> already > >> have > >> issues > >> with the > >> amount > >> of code > >> > >> generation > >> > >> resulting > >> in > >> binary > >> bloat > >> and long > >> compile > >> times, > >> and I > >> > >> worry > >> > >> this > >> > >> would > >> > >> worsen > >> this > >> situation > >> whilst > >> not > >> really > >> providing > >> > >> compelling > >> > >> advantages > >> > >> for the > >> vast > >> majority > >> of > >> workloads > >> that > >> don't > >> interact > >> > >> with > >> > >> Velox. > >> > >> Whilst I > >> can > >> definitely > >> see that > >> the > >> ListView > >> > >> representation > >> > >> is > >> > >> probably > >> > >> a better > >> way to > >> represent > >> variable > >> length > >> lists > >> than what > >> > >> arrow > >> > >> settled > >> > >> upon, > >> I'm not > >> yet > >> convinced > >> it is > >> sufficiently > >> better to > >> > >> incentivise > >> > >> broad > >> ecosystem > >> adoption. > >> Kind > >> Regards, > >> Raphael > >> Taylor-Davies > >> > >> > >> On > >> 11/05/2023 > >> 21:20, > >> Will > >> Jones > >> wrote: > >> Hi > >> Felipe, > >> Thanks > >> for > >> the > >> additional > >> details. > >> > >> > >> Velox > >> kernels > >> benefit > >> from > >> being > >> able > >> to > >> append > >> data > >> to > >> > >> the > >> > >> array > >> > >> from > >> > >> different > >> threads > >> without > >> care > >> for > >> strict > >> ordering. > >> > >> > >> Only the > >> > >> offsets > >> > >> array > >> > >> has > >> to > >> be > >> written > >> according > >> to > >> logical > >> order > >> but > >> that > >> is > >> > >> potentially a > >> > >> much > >> > >> smaller > >> buffer > >> than > >> the > >> values > >> buffer. > >> > >> > >> It > >> still > >> seems > >> to > >> me > >> like > >> applications > >> are > >> still > >> pretty > >> > >> niche, > >> > >> as I > >> > >> suspect > >> > >> in > >> most > >> cases > >> the > >> benefits > >> are > >> outweighed > >> by > >> the > >> costs. > >> > >> The > >> > >> benefit > >> > >> here > >> > >> seems > >> pretty > >> limited: > >> if > >> you > >> are > >> trying > >> to > >> split > >> work > >> > >> between > >> > >> threads, > >> > >> usually > >> you > >> will > >> have > >> other > >> levels > >> such > >> as > >> array > >> chunks > >> > >> to > >> > >> parallelize. > >> > >> And > >> > >> if > >> you > >> have > >> an > >> incoming > >> stream > >> of > >> row > >> data, > >> you'll > >> want > >> > >> to > >> > >> append > >> > >> in > >> > >> predictable > >> order > >> to > >> match > >> the > >> order > >> of > >> the > >> other > >> > >> arrays. Am > >> > >> I > >> > >> missing > >> > >> something? > >> And, > >> IIUC, > >> the > >> cost > >> of > >> using > >> ListView > >> with > >> out-of-order > >> > >> > >> values > >> > >> over > >> > >> ListArray > >> is > >> you > >> lose > >> memory > >> locality; > >> the > >> values > >> of > >> > >> element > >> > >> 2 > >> > >> are > >> > >> no > >> > >> longer > >> adjacent > >> to > >> the > >> values > >> of > >> element > >> 1. > >> What > >> do you > >> > >> think > >> > >> about > >> > >> that > >> > >> tradeoff? > >> I > >> don't > >> mean > >> to > >> be > >> difficult > >> about > >> this. > >> I'm > >> excited > >> for > >> > >> both > >> > >> the > >> > >> REE > >> > >> and > >> > >> StringView > >> arrays, > >> but > >> this > >> one > >> I'm > >> not > >> so > >> sure > >> about > >> > >> yet. I > >> > >> suppose > >> > >> what I > >> > >> am > >> trying > >> to > >> ask > >> is, > >> if > >> we > >> added > >> this, > >> do > >> we > >> think > >> many > >> > >> Arrow > >> > >> and > >> > >> query > >> > >> engine > >> implementations > >> (for > >> example, > >> DataFusion) > >> will be > >> > >> eager > >> > >> to > >> > >> add > >> > >> full > >> > >> support > >> for > >> the > >> type, > >> including > >> compute > >> kernels? > >> Or are > >> > >> they > >> > >> likely > >> > >> to > >> > >> just > >> > >> convert > >> this > >> type > >> to > >> ListArray > >> at > >> import > >> boundaries? > >> Because > >> if > >> it > >> turns > >> out > >> to > >> be > >> the > >> latter, > >> then > >> we > >> might > >> > >> as > >> > >> well > >> > >> ask > >> > >> Velox > >> > >> to > >> export > >> this > >> type > >> as > >> ListArray > >> and > >> save > >> the > >> rest > >> of the > >> > >> ecosystem > >> > >> some > >> > >> work. > >> Best, > >> Will > >> Jones > >> On > >> Thu, > >> May > >> 11, > >> 2023 > >> at > >> 12:32 PM > >> Felipe > >> Oliveira > >> > >> > >> Carvalho < > >> > >> felipe...@gmail.com<mailto:felipe...@gmail.com > >> <mailto:felipe...@gmail.com>>> > >> wrote: > >> > >> Initial > >> reason > >> for > >> ListView > >> arrays > >> in > >> Arrow > >> is > >> zero-copy > >> > >> > >> compatibility > >> > >> with > >> > >> Velox > >> which > >> uses > >> this > >> format. > >> Velox > >> kernels > >> benefit > >> from > >> being > >> able > >> to > >> append > >> data > >> to > >> > >> the > >> > >> array > >> > >> from > >> > >> different > >> threads > >> without > >> care > >> for > >> strict > >> ordering. > >> > >> > >> Only the > >> > >> offsets > >> > >> array > >> > >> has > >> to > >> be > >> written > >> according > >> to > >> logical > >> order > >> but > >> that > >> is > >> > >> potentially a > >> > >> much > >> > >> smaller > >> buffer > >> than > >> the > >> values > >> buffer. > >> Acero > >> kernels > >> could > >> take > >> advantage > >> of > >> that > >> in > >> the > >> > >> future. > >> > >> In > >> implementing > >> ListViewArray/Type > >> I > >> was > >> able > >> to > >> reuse > >> > >> > >> some > >> > >> C++ > >> > >> templates > >> > >> used > >> for > >> ListArray > >> which > >> can > >> reduce > >> some > >> of > >> the > >> burden > >> > >> > >> on > >> > >> kernel > >> > >> implementations > >> that > >> aim > >> to > >> work > >> with > >> all > >> the > >> types. > >> I’m > >> can > >> fix > >> Acero > >> kernels > >> for > >> working > >> with > >> ListView. > >> > >> > >> This is > >> > >> similar > >> > >> to > >> > >> the > >> > >> work > >> I’ve > >> doing > >> in > >> kernels > >> dealing > >> with > >> run-end > >> encoded > >> > >> > >> arrays. > >> > >> — > >> Felipe > >> On > >> Wed, > >> 26 > >> Apr > >> 2023 > >> at > >> 01:03 > >> Will > >> Jones > >> < > >> > >> will.jones...@gmail.com > >> > >> <mailto:will.jones...@gmail.com > >> <mailto:will.jones...@gmail.com>>> > >> wrote: > >> > >> I > >> suppose > >> one > >> common > >> use > >> case > >> is > >> materializing > >> list > >> > >> > >> columns > >> > >> after > >> > >> some > >> > >> expanding > >> operation > >> like > >> a > >> join > >> or > >> unnest. > >> That's > >> a > >> > >> > >> case > >> > >> where > >> > >> I > >> > >> could > >> > >> imagine > >> a > >> lot > >> of > >> repetition > >> of > >> values. > >> Haven't > >> yet > >> > >> > >> thought > >> > >> of > >> > >> common > >> > >> cases > >> > >> > >> where > >> there > >> is > >> overlap > >> but > >> not > >> full > >> duplication, > >> but > >> am > >> > >> > >> eager > >> > >> to > >> > >> hear > >> > >> any. > >> > >> > >> The > >> dictionary > >> encoding > >> point > >> Raphael > >> makes > >> is > >> > >> > >> interesting, > >> > >> especially > >> > >> given > >> the > >> existence > >> of > >> LargeList > >> and > >> FixedSizeList. > >> For > >> > >> > >> many > >> > >> operations, > >> > >> it > >> > >> might > >> make > >> more > >> sense > >> to > >> just > >> compose > >> those > >> existing > >> > >> > >> types. > >> > >> IIUC > >> the > >> operations > >> that > >> would > >> be > >> unique > >> to > >> the > >> > >> > >> ArrayView > >> > >> are > >> > >> ones > >> > >> altering > >> > >> > >> the > >> shape. > >> One > >> could > >> truncate > >> each > >> array > >> to > >> a > >> certain > >> > >> > >> length > >> > >> cheaply > >> > >> simply > >> > >> > >> by > >> replacing > >> the > >> sizes > >> buffer. > >> Or > >> perhaps > >> there > >> are > >> > >> > >> interesting > >> > >> operations > >> > >> > >> on > >> tensors > >> that > >> would > >> benefit. > >> On > >> Tue, > >> Apr > >> 25, > >> 2023 > >> at > >> 7:47 PM > >> Raphael > >> Taylor-Davies > >> <r.taylordav...@googlemail.com.invalid> > >> wrote: > >> > >> > >> Unless > >> I > >> am > >> missing > >> something, > >> I > >> think > >> the > >> selection > >> > >> > >> use-case > >> > >> could > >> > >> be > >> > >> equally > >> well > >> served > >> by > >> a > >> dictionary-encoded > >> > >> > >> BinarArray/ListArray, > >> > >> and > >> > >> would > >> > >> > >> have > >> the > >> benefit > >> of > >> not > >> requiring > >> any > >> modifications > >> > >> > >> to the > >> > >> existing > >> > >> format > >> > >> > >> or > >> kernels. > >> The > >> major > >> additional > >> flexibility > >> of > >> the > >> proposed > >> > >> > >> encoding > >> > >> would > >> > >> be > >> > >> permitting > >> disjoint > >> or > >> overlapping > >> ranges, > >> are > >> these > >> > >> > >> common > >> > >> enough > >> > >> in > >> > >> practice > >> to > >> represent > >> a > >> meaningful > >> bottleneck? > >> On > >> 26 > >> April > >> 2023 > >> 01:40:14 > >> BST, > >> David > >> Li > >> < > >> > >> > >> lidav...@apache.org > >> > >> <mailto: <mailto:> > >> > >> lidav...@apache.org>> > >> wrote: > >> > >> Is > >> there > >> a > >> need > >> for > >> a > >> 64-bit > >> offsets > >> version > >> the > >> > >> > >> same way > >> > >> we > >> > >> have > >> > >> List > >> > >> and > >> LargeList? > >> > >> > >> And > >> just > >> to > >> be > >> clear, > >> the > >> difference > >> with > >> List > >> is > >> > >> > >> that > >> > >> the > >> > >> lists > >> > >> don't > >> > >> have > >> to > >> be > >> stored > >> in > >> their > >> logical > >> order > >> (or > >> in > >> other > >> > >> > >> words, > >> > >> offsets > >> > >> do > >> > >> not > >> > >> > >> have > >> to > >> be > >> nondecreasing > >> and > >> so > >> we > >> also > >> need > >> sizes)? > >> > >> > >> On > >> Wed, > >> Apr > >> 26, > >> 2023, > >> at > >> 09:37, > >> Weston > >> Pace > >> wrote: > >> > >> > >> For > >> context, > >> there > >> was > >> some > >> discussion > >> on > >> this > >> back > >> > >> > >> in > >> > >> [1]. > >> > >> At > >> > >> that > >> > >> time > >> > >> > >> this > >> was > >> called > >> "sequence > >> view" > >> but > >> I > >> do > >> not > >> like > >> > >> > >> that > >> > >> name. > >> > >> However, > >> > >> > >> array-view > >> array > >> is > >> a > >> little > >> confusing. > >> Given > >> this > >> > >> > >> is > >> > >> similar > >> > >> to > >> > >> list > >> > >> > >> can > >> > >> > >> we > >> go > >> with > >> list-view > >> array? > >> > >> > >> Thanks > >> for > >> the > >> introduction. > >> I'd > >> be > >> interested > >> to > >> > >> > >> hear > >> > >> about > >> > >> the > >> > >> applications > >> Velox > >> has > >> found > >> for > >> these > >> vectors, > >> > >> > >> and in > >> > >> what > >> > >> situations > >> > >> > >> they > >> > >> > >> are > >> useful. > >> This > >> could > >> be > >> contrasted > >> with > >> the > >> > >> > >> current > >> > >> ListArray > >> > >> implementations. > >> > >> > >> I > >> believe > >> one > >> significant > >> benefit > >> is > >> that > >> take > >> (and > >> > >> > >> by > >> > >> proxy, > >> > >> filter) > >> > >> > >> and > >> > >> > >> sort > >> are > >> O(# > >> of > >> items) > >> with > >> the > >> proposed > >> format > >> and > >> > >> > >> O(# > >> > >> of > >> > >> bytes) > >> > >> with > >> > >> > >> the > >> > >> > >> current > >> format. > >> Jorge > >> did > >> some > >> profiling > >> to > >> this > >> > >> > >> effect > >> > >> in > >> > >> [1]. > >> > >> [1] > >> > >> > >> https://lists.apache.org/thread/49qzofswg1r5z7zh39pjvd1m2ggz2kdq< > >> > >> > >> https://lists.apache.org/thread/49qzofswg1r5z7zh39pjvd1m2ggz2kdq> > >> > >> > >> On > >> Tue, > >> Apr > >> 25, > >> 2023 > >> at > >> 3:13 PM > >> Will > >> Jones > >> < > >> > >> > >> will.jones...@gmail.com > >> > >> > >> <mailto:will.jones...@gmail.com > >> <mailto:will.jones...@gmail.com>> > >> > >> > >> wrote: > >> > >> > >> Hi > >> Felipe, > >> Thanks > >> for > >> the > >> introduction. > >> I'd > >> be > >> interested > >> to > >> > >> > >> hear > >> > >> about > >> > >> the > >> > >> applications > >> Velox > >> has > >> found > >> for > >> these > >> vectors, > >> > >> > >> and in > >> > >> what > >> > >> situations > >> > >> > >> they > >> > >> > >> are > >> useful. > >> This > >> could > >> be > >> contrasted > >> with > >> the > >> > >> > >> current > >> > >> ListArray > >> > >> implementations. > >> IIUC > >> it > >> would > >> be > >> fairly > >> cheap > >> to > >> transform > >> a > >> > >> > >> ListArray > >> > >> to > >> > >> an > >> > >> ArrayView, > >> but > >> > >> > >> expensive > >> to > >> go > >> the > >> other > >> way. > >> Best, > >> Will > >> Jones > >> On > >> Tue, > >> Apr > >> 25, > >> 2023 > >> at > >> 3:00 PM > >> Felipe > >> Oliveira > >> > >> > >> Carvalho > >> > >> < > >> > >> felipe...@gmail.com<mailto:felipe...@gmail.com > >> <mailto:felipe...@gmail.com>>> > >> > >> > >> wrote: > >> > >> Hi > >> folks, > >> I > >> would > >> like > >> to > >> start > >> a > >> public > >> discussion > >> on > >> the > >> > >> > >> inclusion > >> > >> of a > >> > >> new > >> > >> array > >> > >> > >> format > >> to > >> Arrow > >> — > >> array-view > >> array. > >> The > >> name > >> is > >> > >> > >> also > >> > >> up > >> > >> for > >> > >> debate. > >> > >> > >> This > >> format > >> is > >> inspired > >> by > >> Velox's > >> ArrayVector > >> > >> > >> format > >> > >> [1]. > >> > >> Logically, > >> > >> > >> this > >> > >> > >> array > >> represents > >> an > >> array > >> of > >> arrays. > >> Each > >> element > >> > >> > >> is > >> > >> an > >> > >> array-view > >> > >> > >> (offset > >> > >> > >> and > >> size > >> pair) > >> that > >> points > >> to > >> a > >> range > >> within > >> a > >> > >> > >> nested > >> > >> "values" > >> > >> array > >> > >> > >> (called > >> "elements" > >> in > >> Velox > >> docs). > >> The > >> nested > >> > >> > >> array > >> > >> can > >> > >> be > >> > >> of > >> > >> any > >> > >> type, > >> > >> > >> which > >> makes > >> this > >> format > >> very > >> flexible > >> and > >> > >> > >> powerful. > >> > >> [image: > >> ../_images/array-vector.png] > >> < > >> > >> > >> https://facebookincubator.github.io/velox/_images/array-vector.png > >> > >> > >> < > >> > >> https://facebookincubator.github.io/velox/_images/array-vector.png > >> > >> I'm > >> currently > >> working > >> on > >> a > >> C++ > >> implementation > >> and > >> > >> > >> plan > >> > >> to > >> > >> work > >> > >> on > >> a > >> > >> Go > >> > >> > >> implementation > >> to > >> fulfill > >> the > >> two-implementations > >> > >> > >> requirement > >> > >> for > >> > >> format > >> > >> > >> changes. > >> The > >> draft > >> design: > >> - > >> 3 > >> buffers: > >> [validity_bitmap, > >> int32 > >> offsets > >> > >> > >> buffer, > >> > >> int32 > >> > >> sizes > >> > >> buffer] > >> > >> > >> - > >> 1 > >> child > >> array: > >> "values" > >> as > >> an > >> array > >> of > >> the > >> type > >> > >> > >> parameter > >> > >> validity_bitmap > >> is > >> used > >> to > >> differentiate > >> between > >> > >> > >> empty > >> > >> array > >> > >> views > >> > >> > >> (sizes[i] > >> == > >> 0) > >> and > >> NULL > >> array > >> views > >> > >> > >> (validity_bitmap[i] > >> > >> == > >> > >> 0). > >> > >> When > >> the > >> validity_bitmap[i] > >> is > >> 0, > >> both > >> sizes > >> and > >> > >> > >> offsets > >> > >> are > >> > >> undefined > >> > >> > >> (as > >> > >> > >> usual), > >> and > >> when > >> sizes[i] > >> == > >> 0, > >> offsets[i] > >> is > >> > >> > >> undefined. 0 > >> > >> is > >> > >> recommended > >> > >> > >> if > >> setting > >> a > >> value > >> is > >> not > >> an > >> issue > >> to > >> the > >> system > >> > >> > >> producing > >> > >> the > >> > >> arrays. > >> > >> > >> offsets > >> buffer > >> is > >> not > >> required > >> to > >> be > >> ordered > >> and > >> > >> > >> views > >> > >> don't > >> > >> have > >> > >> to > >> > >> > >> be > >> > >> > >> disjoint. > >> [1] > >> > >> > >> > https://facebookincubator.github.io/velox/develop/vectors.html#arrayvector > >> > >> > >> < > >> > >> > https://facebookincubator.github.io/velox/develop/vectors.html#arrayvector > >> > >> > >> Thanks, > >> Felipe > >> O. > >> Carvalho > >> > >