Since this is now the second time that the proposal of a new type has raised the "alternative layout" question, I'm going to start a new thread about that.
Neal On Wed, Jul 12, 2023 at 11:37 AM Pedro Eugenio Rocha Pedreira <pedro...@meta.com.invalid> wrote: > Hi all, this is Pedro from the Velox team at Meta. Chiming in here to add > a bit more context from our side. > > > I'm not sure the problem here is a lack of understanding or maturity. In > > fact, it would be much easier if this was just a problem of education but > > it is not. > > Adding to what Weston said, when Velox started the intent was to have it > built on top of vanilla Arrow, if not directly using the Arrow C++ library, > at least having an implementation conforming to the data layout. The fact > we decided to "extended" the format into what we internally call Velox > Vectors was very much a deliberate decision. If Arrow had at the time > support for StringView, ListViews (discussed recently in a separate thread, > related to supporting out-of-order writes), and more encodings > (specifically RLE and Constant, which can now be done through new REE), > Velox would have used it from the beginning. The rationale behind these > extensions has been discussed here, but there is more context in our paper > last year [0] (check Section 4.2). > > > I can't speak for all query engines, but at least in the case of > > DataFusion we exclusively use the Arrow format as the interchange format > > between operators, including for UDFs. We have found that for most > > operators operating directly on the Arrow format is sufficiently > > performant to not represent a query bottleneck. > > This always comes down to your workloads and operations you are optimizing > for. We found that these extensions were crucial for our workloads, > particularly for efficient vectorized expression evaluation. Other modern > engines like DuckDB and Umbra had comparable findings and deviate from > Arrow in similar ways. > > > Is Arrow meant to only be used in between systems (in this case query > > engines) or is it also meant to be used in between components of a query > > engine? > > Our findings were that if you really care about performance, Arrow today > is (unfortunately) not sufficient to support the needs of a > state-of-the-art execution engine. Raphael raised a good point above about > Arrow potentially favoring interoperability over performance. In that case, > I believe Arrow's usage would be restricted to communication between > systems, which is about what we see in practice today. > > However, even between system boundaries this is becoming a hurdle. A > recent example is a new project called Gluten [1], which integrates Velox > into Spark (akin to Databrick's Photon). The JNI data communication between > Velox and Java is done using Arrow C ABI. We found that the > incompatibilities that result in non-zero-copy transfer (back to > StringViews) show up high enough as a bottleneck that the team is > considering using Velox Vectors as the exchange format to go around this > limitation. We argued with the Gluten team to stick with Arrow with the > hope we would work with the community to get these extensions incorporated > into the standard. Another example is a bridge between Velox and DuckDB > built inside Velox that converts these formats directly from one another. > We don't use Arrow there for similar reasons. There are other similar > examples between PyTorch's pythonic ecosystem and Velox. > > > I therefore also wonder about the possibility of always having a single > > backing buffer that stores the character data, including potentially a > copy > > of the prefix. > > There are quite a few use cases for this within Velox and other modern > engines, but AFAIK they boil down to two reasons: > > 1. When you're creating string buffers you usually don't know the size of > each string beforehand. This allows us to allocate the buffer page-by-page > as you go, then capturing these buffers without having to reallocate and > copy them. The simplest use case here is a vectorized function that > generates a string as output. > 2. It allows us to create StringView arrays without copying the underlying > data. Use cases here are the receiving end of an exchange (shuffle), or a > simple table scan. In the former, you can capture the pages you directly > read from the socket without copying them (we internally use a format > called IOBuf from folly, which is sort of a linked list of buffers). In the > latter, the pages are usually cached in a memory cache, so you can simply > acquire pointers to these page caches. All of that to prevent unnecessary > copies. > > > [0] - https://vldb.org/pvldb/vol15/p3372-pedreira.pdf > [1] - https://github.com/oap-project/gluten > > Best, > > [ > https://opengraph.githubassets.com/7251524e6ed3ec08793b7dd25fd62e13a4e0f7f985179d5c9f53c9d9fc365ba0/oap-project/gluten > ]<https://github.com/oap-project/gluten> > GitHub - oap-project/gluten: Gluten: Plugin to Double SparkSQL's > Performance<https://github.com/oap-project/gluten> > Gluten: Plugin to Double SparkSQL's Performance. Contribute to > oap-project/gluten development by creating an account on GitHub. > github.com > > > > > > > -- > Pedro Pedreira > ________________________________ > From: Weston Pace <weston.p...@gmail.com> > Sent: Tuesday, July 11, 2023 8:42 AM > To: dev@arrow.apache.org <dev@arrow.apache.org> > Subject: Re: [DISCUSS][Format] Draft implementation of string view array > format > > !-------------------------------------------------------------------| > This Message Is From an External Sender > > |-------------------------------------------------------------------! > > > I definitely hope that with time Arrow will penetrate deeper into these > > engines, perhaps in a similar manner to DataFusion, as opposed to > > primarily existing at the surface-level. > > I'm not sure the problem here is a lack of understanding or maturity. In > fact, it would be much easier if this was just a problem of education but > it is not. Velox is already using Arrow formats (internally, not just at > the boundaries) throughout their system. Same with DuckDb. Both teams are > very familiar with Arrow. They have just made the decision that the string > format is not the correct format to use in a query engine for string data. > > I don't think there is a single best layout for all situations. I don't > think we want to be in the game of declaring which layout is the correct > layout for query engines. I suspect there will be cases where the > appropriate layout is highly dependent on the workload. > > > I just wonder if inclusion in the primary > > standard is really the right place for them. Perhaps some extension > > mechanism might be the way to go here, potentially with some negotiation > > mechanism, I'm not really sure > > > I agree 100% that this sort of interoperability is what makes Arrow so > > compelling and something we should work very hard to preserve. This is > > the crux of my concern with standardising alternative layouts > > I did have a proposal for alternative layouts when we were discussing the > array view layout. I'll repeat it here for completeness: > > * There are one or more primary layouts > * Existing layouts are automatically considered primary layouts, even if > they wouldn't > have been primary layouts initially (e.g. large list) > * A new layout, if it is semantically equivalent to another, is considered > an alternative layout > * An alternative layout still has the same requirements for adoption (two > implementations and a vote) > * An implementation should not feel pressured to rush and implement the > new layout. > It would be good if they contribute in the discussion and consider the > layout and vote > if they feel it would be an acceptable design. > * We can define and vote and approve as many canonical alternative layouts > as we want: > * A canonical alternative layout should, at a minimum, have some > reasonable justification, such as improved performance for algorithm X > * Arrow implementations MUST support the primary layouts > * An Arrow implementation MAY support a canonical alternative, however: > * An Arrow implementation MUST first support the primary layout > * An Arrow implementation MUST support conversion to/from the primary > and canonical layout > * An Arrow implementation's APIs MUST only provide data in the > alternative > layout if it is explicitly asked for (e.g. schema inference should > prefer the primary layout). > * We can still vote for new primary layouts (e.g. promoting a canonical > alternative) but, in these > votes we don't only consider the value (e.g. performance) of the layout > but also the interoperability. > In other words, a layout can only become a primary layout if there is > significant evidence that most > implementations plan to adopt it. > > On Mon, Jul 10, 2023 at 9:49 AM Raphael Taylor-Davies > <r.taylordav...@googlemail.com.invalid> wrote: > > > > For example, if someone (datafusion, velox, etc.) were to come up with > a > > > framework for UDFs then would batches be passed in and out of those > UDFs > > in > > > the Arrow format? > > Yes, I think the arrow format is a perfect fit for this > > > Is Arrow meant to only be used in between systems (in this case query > > > engines) or is it also meant to be used in between components of a > query > > > engine? > > > > I can't speak for all query engines, but at least in the case of > > DataFusion we exclusively use the Arrow format as the interchange format > > between operators, including for UDFs. We have found that for most > > operators operating directly on the Arrow format is sufficiently > > performant to not represent a query bottleneck. For others, such as > > joins, sorts and aggregates, we do make use of bespoke data structures > > and formats internally, e.g. hash tables, row formats, etc..., but the > > operator's public APIs are still in terms of arrow RecordBatch. We have > > found this approach to perform very well, whilst also providing very > > good modularity and composability. > > > > In fact we are actually currently in the process of migrating the > > aggregation logic away from a bespoke mutable row representation to the > > Arrow model, and are already seeing significant performance > > improvements, not to mention a significant reduction in code complexity > > and improved composability [1]. > > > > > If every engine has its own bespoke formats internally > > > then it seems we are placing a limit on how far things can be > decomposed. > > Agreed, if engines choose to implement operations on bespoke formats, > > these operations will likely not be as interoperable as those > > implemented using Arrow. To what extent an engine favours their own > > format(s) over Arrow will be an engineering trade-off they will have to > > make, but DataFusion has found exclusively using Arrow as the > > interchange format between operators to work well. > > > > > There are now multiple implementations of a query > > > engine and I think we are seeing just the edges of this query engine > > > decomposition (e.g. using arrow-c++'s datasets to feed DuckDb or > > consuming > > > a velox task as a record batch stream into a different system) and > these > > > sorts of challenges are in the forefront. > > I agree 100% that this sort of interoperability is what makes Arrow so > > compelling and something we should work very hard to preserve. This is > > the crux of my concern with standardising alternative layouts. I > > definitely hope that with time Arrow will penetrate deeper into these > > engines, perhaps in a similar manner to DataFusion, as opposed to > > primarily existing at the surface-level. > > > > [1]: https://github.com/apache/arrow-datafusion/pull/6800 > > > > On 10/07/2023 11:38, Weston Pace wrote: > > >> The point I was trying to make, albeit very badly, was that these > > >> operations are typically implemented using some sort of row format [1] > > >> [2], and therefore their performance is not impacted by the array > > >> representations. I think it is both inevitable, and in fact something > to > > >> be encouraged, that query engines will implement their own in-memory > > >> layouts and data structures outside of the arrow specification for > > >> specific operators, workloads, hardware, etc... This allows them to > make > > >> trade-offs based on their specific application domain, whilst also > > >> ensuring that new ideas and approaches can continue to be incorporated > > >> and adopted in the broader ecosystem. However, to then seek to > > >> standardise these layouts seems to be both potentially unbounded scope > > >> creep, and also somewhat counter productive if the goal of > > >> standardisation is improved interoperability? > > > FWIW, I believe this formats are very friendly for row representation > as > > > well, especially when stored as a payload (e.g. in a join). > > > > > > For your more general point though I will ask the same question I asked > > on > > > the ArrayView discussion: > > > > > > Is Arrow meant to only be used in between systems (in this case query > > > engines) or is it also meant to be used in between components of a > query > > > engine? > > > > > > For example, if someone (datafusion, velox, etc.) were to come up with > a > > > framework for UDFs then would batches be passed in and out of those > UDFs > > in > > > the Arrow format? If every engine has its own bespoke formats > internally > > > then it seems we are placing a limit on how far things can be > decomposed. > > > From the C++ perspective, I would personally like to see Arrow be > usable > > > within components. There are now multiple implementations of a query > > > engine and I think we are seeing just the edges of this query engine > > > decomposition (e.g. using arrow-c++'s datasets to feed DuckDb or > > consuming > > > a velox task as a record batch stream into a different system) and > these > > > sorts of challenges are in the forefront. > > > > > > On Fri, Jul 7, 2023 at 7:53 AM Raphael Taylor-Davies > > > <r.taylordav...@googlemail.com.invalid> wrote: > > > > > >>> Thus the approach you > > >>> describe for validating an entire character buffer as UTF-8 then > > checking > > >>> offsets will be just as valid for Utf8View arrays as for Utf8 arrays. > > >> The difference here is that it is perhaps expected for Utf8View to > have > > >> gaps in the underlying data that are not referenced as part of any > > >> value, as I had understood this to be one of its benefits over the > > >> current encoding. I think it would therefore be problematic to enforce > > >> these gaps be UTF-8. > > >> > > >>> Furthermore unlike an explicit > > >>> selection vector a kernel may decide to copy and densify dynamically > if > > >> it > > >>> detects that output is getting sparse or fragmented > > >> I don't see why you couldn't do something similar to materialize a > > >> sparse selection vector, if anything being able to centralise this > logic > > >> outside specific kernels would be advantageous. > > >> > > >>> Specifically sorting and equality comparison > > >>> benefit significantly from the prefix comparison fast path, > > >>> so I'd anticipate that multi column sorting and aggregations would as > > >> well > > >> > > >> The point I was trying to make, albeit very badly, was that these > > >> operations are typically implemented using some sort of row format [1] > > >> [2], and therefore their performance is not impacted by the array > > >> representations. I think it is both inevitable, and in fact something > to > > >> be encouraged, that query engines will implement their own in-memory > > >> layouts and data structures outside of the arrow specification for > > >> specific operators, workloads, hardware, etc... This allows them to > make > > >> trade-offs based on their specific application domain, whilst also > > >> ensuring that new ideas and approaches can continue to be incorporated > > >> and adopted in the broader ecosystem. However, to then seek to > > >> standardise these layouts seems to be both potentially unbounded scope > > >> creep, and also somewhat counter productive if the goal of > > >> standardisation is improved interoperability? I fully expect in the > next > > >> 5 years someone will come up with an even better way to encode strings > > >> for some particular workload or hardware, do we then incorporate that > as > > >> well? > > >> > > >> I guess it boils down to what matters to people more, interoperability > > >> or best-in-class performance? Currently I think it is fair to say both > > >> arrow and parquet favour interoperability over performance, aiming to > > >> provide good enough performance broadly on the same order of magnitude > > >> as a custom solution. I personally think this is the right engineering > > >> trade-off, but appreciate opinions may differ. Ultimately I just > really > > >> want arrow to avoid the situation parquet has found itself in, where > the > > >> specification has both far outstripped the ability for the > > >> implementations to keep pace, whilst simultaneously having > standardised > > >> approaches for things like delta encoding that are now considered > > >> extremely sub-optimal for modern hardware. > > >> > > >> That all being said I'm not against adding support for these arrays if > > >> others are already onboard, I just wonder if inclusion in the primary > > >> standard is really the right place for them. Perhaps some extension > > >> mechanism might be the way to go here, potentially with some > negotiation > > >> mechanism, I'm not really sure... I will continue to think on this > > >> > > >> Kind Regards, > > >> > > >> Raphael > > >> > > >> [1]: > > >> > > >> > > > https://duckdb.org/2021/08/27/external-sorting.html#binary-string-comparison > > >> [2]: https://docs.rs/arrow-row/latest/arrow_row/ > > >> > > >> On 06/07/2023 17:47, Benjamin Kietzman wrote: > > >>> @Andrew: > > >>> > > >>> Restricting these arrays to a single buffer will severely decrease > > their > > >>> utility. Since the character data is stored in multiple character > > buffers > > >>> writing Utf8View array can proceed without resizing allocations, > > >>> which is a major overhead when writing Utf8 arrays. Furthermore since > > the > > >>> character buffers have no restrictions on their size, it's > > >> straightforward > > >>> to > > >>> reuse an existing buffer as a character buffer rather than always > > >> allocating > > >>> a new one. In the case of creating an array which shares a lot of > data > > >> with > > >>> another (for example, appending some strings) we can reuse most of > the > > >>> character buffers from the original. Finally Utf8View is well adapted > > for > > >>> efficiently wrapping non-arrow string data for ingestion by a kernel, > > >> even > > >>> if the string data's full extent is not known ahead of time and is > > spread > > >>> across multiple non-contiguous buffers. > > >>> > > >>> @Raphael: > > >>> > > >>>> branch on access > > >>> The branch-on-access is unavoidable since a primary feature of the > > >> Utf8View > > >>> format is keeping short strings inline in the fixed width portion of > > >> data. > > >>> It's worth noting that the inline prefix allows skipping the branch > > >> entirely > > >>> for common cases of comparison, for example when the strings to be > > >> compared > > >>> differ within the first 4 bytes. > > >>> > > >>> In benchmarking (for example while building a hash table) I have not > > >>> observed > > >>> that this branch overly pessimizes access. Although I can't guarantee > > >> every > > >>> Utf8View array will be more efficient than any Utf8 array, it is > > >> certainly > > >>> faster for many relevant cases. Specifically sorting and equality > > >> comparison > > >>> benefit significantly from the prefix comparison fast path, > > >>> so I'd anticipate that multi column sorting and aggregations would as > > >> well. > > >>> If there are any other benchmarks which would help to justify > Utf8View > > in > > >>> your > > >>> mind, I'd be happy to try writing them. > > >>> > > >>>> UTF-8 validation for StringArray can be done very efficiently by > first > > >>> verifying the entire buffer, and then verifying the offsets > correspond > > to > > >>> the start of a UTF-8 codepoint > > >>> > > >>> For non-inlined strings, the character buffers do always contain the > > >> entire > > >>> string's data and not just the last `len - 4` bytes. Thus the > approach > > >> you > > >>> describe for validating an entire character buffer as UTF-8 then > > checking > > >>> offsets will be just as valid for Utf8View arrays as for Utf8 arrays. > > >>> > > >>>> it does seem inconsistent to use unsigned types > > >>> It is indeed more typical for the arrow format to use signed integers > > for > > >>> offsets and other quantities. In this case there is prior art in > other > > >>> engines with which we can remain compatible by using unsigned > integers > > >>> instead. Since this is only a break with convention within the format > > and > > >>> shouldn't be difficult for any implementation to accommodate, I would > > >> argue > > >>> that it's worthwhile to avoid pushing change onto existing > > implementers. > > >>> > > >>>> I presume that StringView will behave similarly to dictionaries in > > that > > >>> the selection kernels will not recompute the underlying value > buffers. > > >>> > > >>> The Utf8View format itself is not prescriptive of selection > operations > > on > > >>> the > > >>> array; kernels are free to reuse character buffers (which produces an > > >>> implicit > > >>> selection vector) or to recompute them. Furthermore unlike an > explicit > > >>> selection vector a kernel may decide to copy and densify dynamically > if > > >> it > > >>> detects that output is getting sparse or fragmented. It's also worth > > >> noting > > >>> that unlike an explicit selection vector a Utf8View array (however > > >> sparse or > > >>> fragmented) will still benefit from the prefix comparison fast path. > > >>> > > >>> Sincerely, > > >>> Ben Kietzman > > >>> > > >>> On Sun, Jul 2, 2023 at 8:01 AM Raphael Taylor-Davies > > >>> <r.taylordav...@googlemail.com.invalid> wrote: > > >>> > > >>>>> I would be interested in hearing some input from the Rust > community. > > >>>> A couple of thoughts: > > >>>> > > >>>> The variable number of buffers would definitely pose some challenges > > for > > >>>> the Rust implementation, the closest thing we currently have is > > possibly > > >>>> UnionArray, but even then the number of buffers is still determined > > >>>> statically by the DataType. I therefore also wonder about the > > >> possibility > > >>>> of always having a single backing buffer that stores the character > > data, > > >>>> including potentially a copy of the prefix. This would also avoid > > >> forcing a > > >>>> branch on access, which I would have expected to hurt performance > for > > >> some > > >>>> kernels quite significantly. > > >>>> > > >>>> Whilst not really a concern for Rust, which supports unsigned types, > > it > > >>>> does seem inconsistent to use unsigned types where the rest of the > > >> format > > >>>> encourages the use of signed offsets, etc... > > >>>> > > >>>> It isn't clearly specified whether a null should have a valid set of > > >>>> offsets, etc... I think it is an important property of the current > > array > > >>>> layouts that, with exception to dictionaries, the data in null slots > > is > > >>>> arbitrary, i.e. can take any value, but not undefined. This allows > for > > >>>> separate handling of the null mask and values, which can be > important > > >> for > > >>>> some kernels and APIs. > > >>>> > > >>>> More an observation than an issue, but UTF-8 validation for > > StringArray > > >>>> can be done very efficiently by first verifying the entire buffer, > and > > >> then > > >>>> verifying the offsets correspond to the start of a UTF-8 codepoint. > > This > > >>>> same approach would not be possible for StringView, which would need > > to > > >>>> verify individual values and would therefore be significantly more > > >>>> expensive. As it is UB for a Rust string to contain non-UTF-8 data, > > this > > >>>> validation is perhaps more important for Rust than for other > > languages. > > >>>> > > >>>> I presume that StringView will behave similarly to dictionaries in > > that > > >>>> the selection kernels will not recompute the underlying value > > buffers. I > > >>>> think this is fine, but it is perhaps worth noting this has caused > > >>>> confusion in the past, as people somewhat reasonably expect an array > > >>>> post-selection to have memory usage reflecting the smaller > selection. > > >> This > > >>>> is then especially noticeable if the data is written out to IPC, and > > >> still > > >>>> contains data that was supposedly filtered out. My 2 cents is that > > >> explicit > > >>>> selection vectors are a less surprising way to defer selection than > > >> baking > > >>>> it into the array, but I also don't have any workloads where this is > > the > > >>>> major bottleneck so can't speak authoritatively here. > > >>>> > > >>>> Which leads on to my major concern with this proposal, that it adds > > >>>> complexity and cognitive load to the specification and > > implementations, > > >>>> whilst not meaningfully improving the performance of the operators > > that > > >> I > > >>>> commonly encounter as performance bottlenecks, which are > multi-column > > >> sorts > > >>>> and aggregations, or the expensive string operations such as > matching > > or > > >>>> parsing. If we didn't already have a string representation I would > be > > >> more > > >>>> onboard, but as it stands I'm definitely on the fence, especially > > given > > >>>> selection performance can be improved in less intrusive ways using > > >>>> dictionaries or selection vectors. > > >>>> > > >>>> Kind Regards, > > >>>> > > >>>> Raphael Taylor-Davies > > >>>> > > >>>> On 02/07/2023 11:46, Andrew Lamb wrote: > > >>>> > > >>>> * This is the first layout where the number of buffers depends on > > the > > >>>> > > >>>> data > > >>>> > > >>>> and not the schema. I think this is the most architecturally > > significant > > >>>> fact. I > > >>>> > > >>>> I have spent some time reading the initial proposal -- thank you > > for > > >>>> that. I now understand what Weston was saying about the "variable > > >> numbers > > >>>> of buffers". I wonder if you considered restricting such arrays to a > > >> single > > >>>> buffer (so as to make them more similar to other arrow array types > > that > > >>>> have a fixed number of buffers)? On Tue, Jun 20, 2023 at 11:33 AM > > Weston > > >>>> Pace<weston.p...@gmail.com> <mailto:weston.p...@gmail.com> wrote: > > >>>> > > >>>> Before I say anything else I'll say that I am in favor of this new > > >> layout. > > >>>> There is some existing literature on the idea (e.g. umbra) and your > > >>>> benchmarks show some nice improvements. Compared to some of the > other > > >>>> layouts we've discussed recently (REE, list veiw) I do think this > > >> layout is > > >>>> more unique and fundamentally different. Perhaps most fundamentally > > >>>> different: * This is the first layout where the number of buffers > > >> depends > > >>>> on the data and not the schema. I think this is the most > > architecturally > > >>>> significant fact. It does require a (backwards compatible) change to > > the > > >>>> IPC format itself, beyond just adding new type codes. It also poses > > >>>> challenges in places where we've assumed there will be at most 3 > > buffers > > >>>> (e.g. in ArraySpan, though, as you have shown, we can work around > this > > >>>> using a raw pointers representation internally in those spots). I > > think > > >>>> you've done some great work to integrate this well with Arrow-C++ > and > > >> I'm > > >>>> convinced it can work. I would be interested in hearing some input > > from > > >> the > > >>>> Rust community. Ben, at one point there was some discussion that > this > > >> might > > >>>> be a c-data only type. However, I believe that was based on the raw > > >>>> pointers representation. What you've proposed here, if I understand > > >>>> correctly, is an index + offsets representation and it is suitable > for > > >> IPC > > >>>> correct? (e.g. I see that you have changes and examples in the IPC > > >>>> reader/writer) On Mon, Jun 19, 2023 at 7:17 AM Benjamin Kietzman < > > >>>> bengil...@gmail.com> <mailto:bengil...@gmail.com> wrote: > > >>>> > > >>>> Hi Gang, I'm not sure what you mean, sorry if my answers are off > base: > > >>>> Parquet's ByteArray will be unaffected by the addition of the string > > >> view > > >>>> type; all arrow strings > > (arrow::Type::STRING,arrow::Type::LARGE_STRING, > > >>>> and with this patcharrow::Type::STRING_VIEW) are converted to > > ByteArrays > > >>>> during serialization to parquet [1]. If you mean that encoding of > > >>>> arrow::Type::STRING_VIEW will not be as fast as encoding of > > equivalent > > >>>> arrow::Type::STRING, that's something I haven't benchmarked so I > can't > > >>>> answer definitively. I would expect it to be > > >>>> > > >>>> faster > > >>>> > > >>>> than first converting STRING_VIEW->STRING then encoding to parquet; > > >> direct > > >>>> encoding avoids allocating and populating temporary buffers. Of > course > > >>>> > > >>>> this > > >>>> > > >>>> only applies to cases where you need to encode an array of > STRING_VIEW > > >> to > > >>>> parquet- encoding of STRING to parquet will be unaffected. > Sincerely, > > >> Ben > > >>>> [1] > > >>>> > > >>>> > > >>>> > > >> > > > https://github.com/bkietz/arrow/blob/46cf7e67766f0646760acefa4d2d01cdfead2d5d/cpp/src/parquet/encoding.cc#L166-L179 > > >>>> On Thu, Jun 15, 2023 at 10:34 PM Gang Wu<ust...@gmail.com> > > <mailto: > > >> ust...@gmail.com> wrote: > > >>>> Hi Ben, The posted benchmark [1] looks pretty good to me. However, I > > >> want > > >>>> to raise a possible issue from the perspective of parquet-cpp. > > >> Parquet-cpp > > >>>> uses a customizedparquet::ByteArray type [2] for string/binary, I > > >>>> > > >>>> would > > >>>> > > >>>> expect some regression of conversions between parquet reader/writer > > and > > >>>> the proposed string view array, especially when some strings use > short > > >> form > > >>>> and others use long form. [1] > > >>>> > > >>>> > > >>>> > > >> > > > https://github.com/apache/arrow/blob/41309de8dd91a9821873fc5f94339f0542ca0108/cpp/src/parquet/types.h#L575 > > >>>> [2] > https://github.com/apache/arrow/pull/35628#issuecomment-1583218617 > > >>>> Best, Gang On Fri, Jun 16, 2023 at 3:58 AM Will Jones < > > >>>> will.jones...@gmail.com> <mailto:will.jones...@gmail.com> wrote: > > >>>> > > >>>> Cool. Thanks for doing that! On Thu, Jun 15, 2023 at 12:40 Benjamin > > >>>> Kietzman <bengil...@gmail.com <mailto:bengil...@gmail.com> > > >>>> > > >>>> wrote: > > >>>> > > >>>> I've addedhttps://github.com/apache/arrow/issues/36112 to track > > >>>> deduplication of buffers on write. I don't think it would require > > >>>> modification of the IPC format. Ben On Thu, Jun 15, 2023 at 1:30 PM > > Matt > > >>>> Topol <zotthewiz...@gmail.com <mailto:zotthewiz...@gmail.com> > > >>>> > > >>>> wrote: > > >>>> > > >>>> Based on my understanding, in theory a buffer *could* be shared > > >>>> > > >>>> within > > >>>> > > >>>> a > > >>>> > > >>>> batch since the flatbuffers message just uses an offset and > > >>>> > > >>>> length > > >>>> > > >>>> to > > >>>> > > >>>> identify the buffers. That said, I don't believe any current > > >>>> implementation actually > > >>>> > > >>>> does > > >>>> > > >>>> this > > >>>> > > >>>> or > > >>>> > > >>>> takes advantage of this in any meaningful way. --Matt On Thu, Jun > 15, > > >> 2023 > > >>>> at 1:00 PM Will Jones < > > >>>> > > >>>> will.jones...@gmail.com <mailto:will.jones...@gmail.com>> > > >>>> > > >>>> wrote: > > >>>> > > >>>> Hi Ben, It's exciting to see this move along. The buffers will be > > >>>> duplicated. If buffer duplication is > > >>>> > > >>>> becomes > > >>>> > > >>>> a > > >>>> > > >>>> concern, > > >>>> > > >>>> I'd prefer to handle that in the ipc writer. Then buffers which are > > >>>> duplicated > > >>>> > > >>>> could > > >>>> > > >>>> be > > >>>> > > >>>> detected > > >>>> > > >>>> by checking pointer identity and written only once. > > >>>> > > >>>> Question: to be able to write buffer only once and reference in > > >>>> > > >>>> multiple > > >>>> > > >>>> arrays, does that require a change to the IPC format? Or is > > >>>> > > >>>> sharing > > >>>> > > >>>> buffers > > >>>> > > >>>> within the same batch already allowed in the IPC format? Best, Will > > >> Jones > > >>>> On Thu, Jun 15, 2023 at 9:03 AM Benjamin Kietzman < > > >>>> > > >>>> bengil...@gmail.com <mailto:bengil...@gmail.com> > > >>>> > > >>>> wrote: > > >>>> > > >>>> Hello again all, The PR [1] to add string view to the format and the > > C++ > > >>>> > > >>>> implementation > > >>>> > > >>>> is > > >>>> > > >>>> hovering around passing CI and has been undrafted. > > >>>> > > >>>> Furthermore, > > >>>> > > >>>> there > > >>>> > > >>>> is > > >>>> > > >>>> now also a PR [2] to add string view to the Go > > >>>> > > >>>> implementation. > > >>>> > > >>>> Code > > >>>> > > >>>> review > > >>>> > > >>>> is underway for each PR and I'd like to move toward a vote > > >>>> > > >>>> for > > >>>> > > >>>> acceptance- > > >>>> > > >>>> are there any other preliminaries which I've neglected? To reiterate > > the > > >>>> answers to some past questions: - Benchmarks are added in the C++ PR > > >> [1] to > > >>>> demonstrate the > > >>>> > > >>>> performance > > >>>> > > >>>> of > > >>>> > > >>>> conversion between the various string formats. In addition, > > >>>> > > >>>> there > > >>>> > > >>>> are > > >>>> > > >>>> some benchmarks which demonstrate the performance gains > > >>>> > > >>>> available > > >>>> > > >>>> with > > >>>> > > >>>> the new format [3]. - Adding string view to the C ABI is a > natural > > >> follow > > >>>> up, but > > >>>> > > >>>> should > > >>>> > > >>>> be > > >>>> > > >>>> handled independently. An issue has been added to track > > >>>> > > >>>> that > > >>>> > > >>>> enhancement [4]. Sincerely, Ben Kietzman [1] > > >>>> https://github.com/apache/arrow/pull/35628 [2] > > >>>> https://github.com/apache/arrow/pull/35769 [3] > > >>>> > > >>>> https://github.com/apache/arrow/pull/35628#issuecomment-1583218617 > > >>>> > > >>>> [4]https://github.com/apache/arrow/issues/36099 On Wed, May 17, > 2023 > > >> at > > >>>> 12:53 PM Benjamin Kietzman < > > >>>> > > >>>> bengil...@gmail.com <mailto:bengil...@gmail.com>> > > >>>> > > >>>> wrote: > > >>>> > > >>>> @Jacob > > >>>> > > >>>> You mention benchmarks multiple times, are these results > > >>>> > > >>>> published > > >>>> > > >>>> somewhere? I benchmarked the performance of raw pointer vs index > > >>>> > > >>>> offset > > >>>> > > >>>> views > > >>>> > > >>>> in > > >>>> > > >>>> my > > >>>> > > >>>> PR to velox, I do intend to port them to my arrow PR but I haven't > > >>>> > > >>>> gotten > > >>>> > > >>>> there > > >>>> > > >>>> yet. > > >>>> > > >>>> Furthermore, it seemed less urgent to me since coexistence of the > two > > >>>> types > > >>>> > > >>>> in > > >>>> > > >>>> the > > >>>> > > >>>> c++ > > >>>> > > >>>> implementation defers the question of how aggressively one should be > > >>>> > > >>>> preferred > > >>>> > > >>>> over > > >>>> > > >>>> the > > >>>> > > >>>> other. @Dewey > > >>>> > > >>>> I don't see the C Data interface in the PR > > >>>> > > >>>> I have not addressed the C ABI in this PR. As you mention, > > >>>> > > >>>> it > > >>>> > > >>>> may > > >>>> > > >>>> be > > >>>> > > >>>> useful to transmit arrays with raw pointer views between > > implementations > > >>>> which > > >>>> > > >>>> allow > > >>>> > > >>>> them. I > > >>>> > > >>>> can address this in a follow up PR. @Will > > >>>> > > >>>> If I understand correctly, multiple arrays can reference > > >>>> > > >>>> the > > >>>> > > >>>> same > > >>>> > > >>>> buffers > > >>>> > > >>>> in memory, but once they are written to IPC their data > > >