Re: [DISCUSS][Format] Draft implementation of string view array format

Neal Richardson Wed, 12 Jul 2023 15:02:54 -0700

Since this is now the second time that the proposal of a new type has
raised the "alternative layout" question, I'm going to start a new thread
about that.


Neal

On Wed, Jul 12, 2023 at 11:37 AM Pedro Eugenio Rocha Pedreira
<pedro...@meta.com.invalid> wrote:

> Hi all, this is Pedro from the Velox team at Meta. Chiming in here to add
> a bit more context from our side.
>
> > I'm not sure the problem here is a lack of understanding or maturity. In
> > fact, it would be much easier if this was just a problem of education but
> > it is not.
>
> Adding to what Weston said, when Velox started the intent was to have it
> built on top of vanilla Arrow, if not directly using the Arrow C++ library,
> at least having an implementation conforming to the data layout. The fact
> we decided to "extended" the format into what we internally call Velox
> Vectors was very much a deliberate decision. If Arrow had at the time
> support for StringView, ListViews (discussed recently in a separate thread,
> related to supporting out-of-order writes), and more encodings
> (specifically RLE and Constant, which can now be done through new REE),
> Velox would have used it from the beginning. The rationale behind these
> extensions has been discussed here, but there is more context in our paper
> last year [0] (check Section 4.2).
>
> > I can't speak for all query engines, but at least in the case of
> > DataFusion we exclusively use the Arrow format as the interchange format
> > between operators, including for UDFs. We have found that for most
> > operators operating directly on the Arrow format is sufficiently
> > performant to not represent a query bottleneck.
>
> This always comes down to your workloads and operations you are optimizing
> for. We found that these extensions were crucial for our workloads,
> particularly for efficient vectorized expression evaluation. Other modern
> engines like DuckDB and Umbra had comparable findings and deviate from
> Arrow in similar ways.
>
> > Is Arrow meant to only be used in between systems (in this case query
> > engines) or is it also meant to be used in between components of a query
> > engine?
>
> Our findings were that if you really care about performance, Arrow today
> is (unfortunately) not sufficient to support the needs of a
> state-of-the-art execution engine. Raphael raised a good point above about
> Arrow potentially favoring interoperability over performance. In that case,
> I believe Arrow's usage would be restricted to communication between
> systems, which is about what we see in practice today.
>
> However, even between system boundaries this is becoming a hurdle. A
> recent example is a new project called Gluten [1], which integrates Velox
> into Spark (akin to Databrick's Photon). The JNI data communication between
> Velox and Java is done using Arrow C ABI. We found that the
> incompatibilities that result in non-zero-copy transfer (back to
> StringViews) show up high enough as a bottleneck that the team is
> considering using Velox Vectors as the exchange format to go around this
> limitation. We argued with the Gluten team to stick with Arrow with the
> hope we would work with the community to get these extensions incorporated
> into the standard. Another example is a bridge between Velox and DuckDB
> built inside Velox that converts these formats directly from one another.
> We don't use Arrow there for similar reasons. There are other similar
> examples between PyTorch's pythonic ecosystem and Velox.
>
> > I therefore also wonder about the possibility of always having a single
> > backing buffer that stores the character data, including potentially a
> copy
> > of the prefix.
>
> There are quite a few use cases for this within Velox and other modern
> engines, but AFAIK they boil down to two reasons:
>
> 1. When you're creating string buffers you usually don't know the size of
> each string beforehand. This allows us to allocate the buffer page-by-page
> as you go, then capturing these buffers without having to reallocate and
> copy them. The simplest use case here is a vectorized function that
> generates a string as output.
> 2. It allows us to create StringView arrays without copying the underlying
> data. Use cases here are the receiving end of an exchange (shuffle), or a
> simple table scan. In the former, you can capture the pages you directly
> read from the socket without copying them (we internally use a format
> called IOBuf from folly, which is sort of a linked list of buffers). In the
> latter, the pages are usually cached in a memory cache, so you can simply
> acquire pointers to these page caches. All of that to prevent unnecessary
> copies.
>
>
> [0] - https://vldb.org/pvldb/vol15/p3372-pedreira.pdf
> [1] - https://github.com/oap-project/gluten
>
> Best,
>
> [
> https://opengraph.githubassets.com/7251524e6ed3ec08793b7dd25fd62e13a4e0f7f985179d5c9f53c9d9fc365ba0/oap-project/gluten
> ]<https://github.com/oap-project/gluten>
> GitHub - oap-project/gluten: Gluten: Plugin to Double SparkSQL's
> Performance<https://github.com/oap-project/gluten>
> Gluten: Plugin to Double SparkSQL's Performance. Contribute to
> oap-project/gluten development by creating an account on GitHub.
> github.com
> 
> 
> 
> 
>
>
> --
> Pedro Pedreira
> ________________________________
> From: Weston Pace <weston.p...@gmail.com>
> Sent: Tuesday, July 11, 2023 8:42 AM
> To: dev@arrow.apache.org <dev@arrow.apache.org>
> Subject: Re: [DISCUSS][Format] Draft implementation of string view array
> format
>
> !-------------------------------------------------------------------|
>   This Message Is From an External Sender
>
> |-------------------------------------------------------------------!
>
> > I definitely hope that with time Arrow will penetrate deeper into these
> > engines, perhaps in a similar manner to DataFusion, as opposed to
> > primarily existing at the surface-level.
>
> I'm not sure the problem here is a lack of understanding or maturity.  In
> fact, it would be much easier if this was just a problem of education but
> it is not.  Velox is already using Arrow formats (internally, not just at
> the boundaries) throughout their system.  Same with DuckDb.  Both teams are
> very familiar with Arrow.  They have just made the decision that the string
> format is not the correct format to use in a query engine for string data.
>
> I don't think there is a single best layout for all situations.  I don't
> think we want to be in the game of declaring which layout is the correct
> layout for query engines.  I suspect there will be cases where the
> appropriate layout is highly dependent on the workload.
>
> > I just wonder if inclusion in the primary
> > standard is really the right place for them. Perhaps some extension
> > mechanism might be the way to go here, potentially with some negotiation
> > mechanism, I'm not really sure
>
> > I agree 100% that this sort of interoperability is what makes Arrow so
> > compelling and something we should work very hard to preserve. This is
> > the crux of my concern with standardising alternative layouts
>
> I did have a proposal for alternative layouts when we were discussing the
> array view layout.  I'll repeat it here for completeness:
>
>  * There are one or more primary layouts
>    * Existing layouts are automatically considered primary layouts, even if
> they wouldn't
>      have been primary layouts initially (e.g. large list)
>  * A new layout, if it is semantically equivalent to another, is considered
> an alternative layout
>  * An alternative layout still has the same requirements for adoption (two
> implementations and a vote)
>    * An implementation should not feel pressured to rush and implement the
> new layout.
>      It would be good if they contribute in the discussion and consider the
> layout and vote
>      if they feel it would be an acceptable design.
>  * We can define and vote and approve as many canonical alternative layouts
> as we want:
>    * A canonical alternative layout should, at a minimum, have some
>      reasonable justification, such as improved performance for algorithm X
>  * Arrow implementations MUST support the primary layouts
>  * An Arrow implementation MAY support a canonical alternative, however:
>    * An Arrow implementation MUST first support the primary layout
>    * An Arrow implementation MUST support conversion to/from the primary
> and canonical layout
>    * An Arrow implementation's APIs MUST only provide data in the
> alternative
>      layout if it is explicitly asked for (e.g. schema inference should
> prefer the primary layout).
>  * We can still vote for new primary layouts (e.g. promoting a canonical
> alternative) but, in these
>     votes we don't only consider the value (e.g. performance) of the layout
> but also the interoperability.
>     In other words, a layout can only become a primary layout if there is
> significant evidence that most
>     implementations plan to adopt it.
>
> On Mon, Jul 10, 2023 at 9:49 AM Raphael Taylor-Davies
> <r.taylordav...@googlemail.com.invalid> wrote:
>
> > > For example, if someone (datafusion, velox, etc.) were to come up with
> a
> > > framework for UDFs then would batches be passed in and out of those
> UDFs
> > in
> > > the Arrow format?
> > Yes, I think the arrow format is a perfect fit for this
> > > Is Arrow meant to only be used in between systems (in this case query
> > > engines) or is it also meant to be used in between components of a
> query
> > > engine?
> >
> > I can't speak for all query engines, but at least in the case of
> > DataFusion we exclusively use the Arrow format as the interchange format
> > between operators, including for UDFs. We have found that for most
> > operators operating directly on the Arrow format is sufficiently
> > performant to not represent a query bottleneck. For others, such as
> > joins, sorts and aggregates, we do make use of bespoke data structures
> > and formats internally, e.g. hash tables, row formats, etc..., but the
> > operator's public APIs are still in terms of arrow RecordBatch. We have
> > found this approach to perform very well, whilst also providing very
> > good modularity and composability.
> >
> > In fact we are actually currently in the process of migrating the
> > aggregation logic away from a bespoke mutable row representation to the
> > Arrow model, and are already seeing significant performance
> > improvements, not to mention a significant reduction in code complexity
> > and improved composability [1].
> >
> > > If every engine has its own bespoke formats internally
> > > then it seems we are placing a limit on how far things can be
> decomposed.
> > Agreed, if engines choose to implement operations on bespoke formats,
> > these operations will likely not be as interoperable as those
> > implemented using Arrow. To what extent an engine favours their own
> > format(s) over Arrow will be an engineering trade-off they will have to
> > make, but DataFusion has found exclusively using Arrow as the
> > interchange format between operators to work well.
> >
> > > There are now multiple implementations of a query
> > > engine and I think we are seeing just the edges of this query engine
> > > decomposition (e.g. using arrow-c++'s datasets to feed DuckDb or
> > consuming
> > > a velox task as a record batch stream into a different system) and
> these
> > > sorts of challenges are in the forefront.
> > I agree 100% that this sort of interoperability is what makes Arrow so
> > compelling and something we should work very hard to preserve. This is
> > the crux of my concern with standardising alternative layouts. I
> > definitely hope that with time Arrow will penetrate deeper into these
> > engines, perhaps in a similar manner to DataFusion, as opposed to
> > primarily existing at the surface-level.
> >
> > [1]: https://github.com/apache/arrow-datafusion/pull/6800
> >
> > On 10/07/2023 11:38, Weston Pace wrote:
> > >> The point I was trying to make, albeit very badly, was that these
> > >> operations are typically implemented using some sort of row format [1]
> > >> [2], and therefore their performance is not impacted by the array
> > >> representations. I think it is both inevitable, and in fact something
> to
> > >> be encouraged, that query engines will implement their own in-memory
> > >> layouts and data structures outside of the arrow specification for
> > >> specific operators, workloads, hardware, etc... This allows them to
> make
> > >> trade-offs based on their specific application domain, whilst also
> > >> ensuring that new ideas and approaches can continue to be incorporated
> > >> and adopted in the broader ecosystem. However, to then seek to
> > >> standardise these layouts seems to be both potentially unbounded scope
> > >> creep, and also somewhat counter productive if the goal of
> > >> standardisation is improved interoperability?
> > > FWIW, I believe this formats are very friendly for row representation
> as
> > > well, especially when stored as a payload (e.g. in a join).
> > >
> > > For your more general point though I will ask the same question I asked
> > on
> > > the ArrayView discussion:
> > >
> > > Is Arrow meant to only be used in between systems (in this case query
> > > engines) or is it also meant to be used in between components of a
> query
> > > engine?
> > >
> > > For example, if someone (datafusion, velox, etc.) were to come up with
> a
> > > framework for UDFs then would batches be passed in and out of those
> UDFs
> > in
> > > the Arrow format?  If every engine has its own bespoke formats
> internally
> > > then it seems we are placing a limit on how far things can be
> decomposed.
> > >  From the C++ perspective, I would personally like to see Arrow be
> usable
> > > within components.  There are now multiple implementations of a query
> > > engine and I think we are seeing just the edges of this query engine
> > > decomposition (e.g. using arrow-c++'s datasets to feed DuckDb or
> > consuming
> > > a velox task as a record batch stream into a different system) and
> these
> > > sorts of challenges are in the forefront.
> > >
> > > On Fri, Jul 7, 2023 at 7:53 AM Raphael Taylor-Davies
> > > <r.taylordav...@googlemail.com.invalid> wrote:
> > >
> > >>> Thus the approach you
> > >>> describe for validating an entire character buffer as UTF-8 then
> > checking
> > >>> offsets will be just as valid for Utf8View arrays as for Utf8 arrays.
> > >> The difference here is that it is perhaps expected for Utf8View to
> have
> > >> gaps in the underlying data that are not referenced as part of any
> > >> value, as I had understood this to be one of its benefits over the
> > >> current encoding. I think it would therefore be problematic to enforce
> > >> these gaps be UTF-8.
> > >>
> > >>> Furthermore unlike an explicit
> > >>> selection vector a kernel may decide to copy and densify dynamically
> if
> > >> it
> > >>> detects that output is getting sparse or fragmented
> > >> I don't see why you couldn't do something similar to materialize a
> > >> sparse selection vector, if anything being able to centralise this
> logic
> > >> outside specific kernels would be advantageous.
> > >>
> > >>> Specifically sorting and equality comparison
> > >>> benefit significantly from the prefix comparison fast path,
> > >>> so I'd anticipate that multi column sorting and aggregations would as
> > >> well
> > >>
> > >> The point I was trying to make, albeit very badly, was that these
> > >> operations are typically implemented using some sort of row format [1]
> > >> [2], and therefore their performance is not impacted by the array
> > >> representations. I think it is both inevitable, and in fact something
> to
> > >> be encouraged, that query engines will implement their own in-memory
> > >> layouts and data structures outside of the arrow specification for
> > >> specific operators, workloads, hardware, etc... This allows them to
> make
> > >> trade-offs based on their specific application domain, whilst also
> > >> ensuring that new ideas and approaches can continue to be incorporated
> > >> and adopted in the broader ecosystem. However, to then seek to
> > >> standardise these layouts seems to be both potentially unbounded scope
> > >> creep, and also somewhat counter productive if the goal of
> > >> standardisation is improved interoperability? I fully expect in the
> next
> > >> 5 years someone will come up with an even better way to encode strings
> > >> for some particular workload or hardware, do we then incorporate that
> as
> > >> well?
> > >>
> > >> I guess it boils down to what matters to people more, interoperability
> > >> or best-in-class performance? Currently I think it is fair to say both
> > >> arrow and parquet favour interoperability over performance, aiming to
> > >> provide good enough performance broadly on the same order of magnitude
> > >> as a custom solution. I personally think this is the right engineering
> > >> trade-off, but appreciate opinions may differ. Ultimately I just
> really
> > >> want arrow to avoid the situation parquet has found itself in, where
> the
> > >> specification has both far outstripped the ability for the
> > >> implementations to keep pace, whilst simultaneously having
> standardised
> > >> approaches for things like delta encoding that are now considered
> > >> extremely sub-optimal for modern hardware.
> > >>
> > >> That all being said I'm not against adding support for these arrays if
> > >> others are already onboard, I just wonder if inclusion in the primary
> > >> standard is really the right place for them. Perhaps some extension
> > >> mechanism might be the way to go here, potentially with some
> negotiation
> > >> mechanism, I'm not really sure... I will continue to think on this
> > >>
> > >> Kind Regards,
> > >>
> > >> Raphael
> > >>
> > >> [1]:
> > >>
> > >>
> >
> https://duckdb.org/2021/08/27/external-sorting.html#binary-string-comparison
> > >> [2]: https://docs.rs/arrow-row/latest/arrow_row/
> > >>
> > >> On 06/07/2023 17:47, Benjamin Kietzman wrote:
> > >>> @Andrew:
> > >>>
> > >>> Restricting these arrays to a single buffer will severely decrease
> > their
> > >>> utility. Since the character data is stored in multiple character
> > buffers
> > >>> writing Utf8View array can proceed without resizing allocations,
> > >>> which is a major overhead when writing Utf8 arrays. Furthermore since
> > the
> > >>> character buffers have no restrictions on their size, it's
> > >> straightforward
> > >>> to
> > >>> reuse an existing buffer as a character buffer rather than always
> > >> allocating
> > >>> a new one. In the case of creating an array which shares a lot of
> data
> > >> with
> > >>> another (for example, appending some strings) we can reuse most of
> the
> > >>> character buffers from the original. Finally Utf8View is well adapted
> > for
> > >>> efficiently wrapping non-arrow string data for ingestion by a kernel,
> > >> even
> > >>> if the string data's full extent is not known ahead of time and is
> > spread
> > >>> across multiple non-contiguous buffers.
> > >>>
> > >>> @Raphael:
> > >>>
> > >>>> branch on access
> > >>> The branch-on-access is unavoidable since a primary feature of the
> > >> Utf8View
> > >>> format is keeping short strings inline in the fixed width portion of
> > >> data.
> > >>> It's worth noting that the inline prefix allows skipping the branch
> > >> entirely
> > >>> for common cases of comparison, for example when the strings to be
> > >> compared
> > >>> differ within the first 4 bytes.
> > >>>
> > >>> In benchmarking (for example while building a hash table) I have not
> > >>> observed
> > >>> that this branch overly pessimizes access. Although I can't guarantee
> > >> every
> > >>> Utf8View array will be more efficient than any Utf8 array, it is
> > >> certainly
> > >>> faster for many relevant cases. Specifically sorting and equality
> > >> comparison
> > >>> benefit significantly from the prefix comparison fast path,
> > >>> so I'd anticipate that multi column sorting and aggregations would as
> > >> well.
> > >>> If there are any other benchmarks which would help to justify
> Utf8View
> > in
> > >>> your
> > >>> mind, I'd be happy to try writing them.
> > >>>
> > >>>> UTF-8 validation for StringArray can be done very efficiently by
> first
> > >>> verifying the entire buffer, and then verifying the offsets
> correspond
> > to
> > >>> the start of a UTF-8 codepoint
> > >>>
> > >>> For non-inlined strings, the character buffers do always contain the
> > >> entire
> > >>> string's data and not just the last `len - 4` bytes. Thus the
> approach
> > >> you
> > >>> describe for validating an entire character buffer as UTF-8 then
> > checking
> > >>> offsets will be just as valid for Utf8View arrays as for Utf8 arrays.
> > >>>
> > >>>> it does seem inconsistent to use unsigned types
> > >>> It is indeed more typical for the arrow format to use signed integers
> > for
> > >>> offsets and other quantities. In this case there is prior art in
> other
> > >>> engines with which we can remain compatible by using unsigned
> integers
> > >>> instead. Since this is only a break with convention within the format
> > and
> > >>> shouldn't be difficult for any implementation to accommodate, I would
> > >> argue
> > >>> that it's worthwhile to avoid pushing change onto existing
> > implementers.
> > >>>
> > >>>> I presume that StringView will behave similarly to dictionaries in
> > that
> > >>> the selection kernels will not recompute the underlying value
> buffers.
> > >>>
> > >>> The Utf8View format itself is not prescriptive of selection
> operations
> > on
> > >>> the
> > >>> array; kernels are free to reuse character buffers (which produces an
> > >>> implicit
> > >>> selection vector) or to recompute them. Furthermore unlike an
> explicit
> > >>> selection vector a kernel may decide to copy and densify dynamically
> if
> > >> it
> > >>> detects that output is getting sparse or fragmented. It's also worth
> > >> noting
> > >>> that unlike an explicit selection vector a Utf8View array (however
> > >> sparse or
> > >>> fragmented) will still benefit from the prefix comparison fast path.
> > >>>
> > >>> Sincerely,
> > >>> Ben Kietzman
> > >>>
> > >>> On Sun, Jul 2, 2023 at 8:01 AM Raphael Taylor-Davies
> > >>> <r.taylordav...@googlemail.com.invalid>  wrote:
> > >>>
> > >>>>> I would be interested in hearing some input from the Rust
> community.
> > >>>>    A couple of thoughts:
> > >>>>
> > >>>> The variable number of buffers would definitely pose some challenges
> > for
> > >>>> the Rust implementation, the closest thing we currently have is
> > possibly
> > >>>> UnionArray, but even then the number of buffers is still determined
> > >>>> statically by the DataType. I therefore also wonder about the
> > >> possibility
> > >>>> of always having a single backing buffer that stores the character
> > data,
> > >>>> including potentially a copy of the prefix. This would also avoid
> > >> forcing a
> > >>>> branch on access, which I would have expected to hurt performance
> for
> > >> some
> > >>>> kernels quite significantly.
> > >>>>
> > >>>> Whilst not really a concern for Rust, which supports unsigned types,
> > it
> > >>>> does seem inconsistent to use unsigned types where the rest of the
> > >> format
> > >>>> encourages the use of signed offsets, etc...
> > >>>>
> > >>>> It isn't clearly specified whether a null should have a valid set of
> > >>>> offsets, etc... I think it is an important property of the current
> > array
> > >>>> layouts that, with exception to dictionaries, the data in null slots
> > is
> > >>>> arbitrary, i.e. can take any value, but not undefined. This allows
> for
> > >>>> separate handling of the null mask and values, which can be
> important
> > >> for
> > >>>> some kernels and APIs.
> > >>>>
> > >>>> More an observation than an issue, but UTF-8 validation for
> > StringArray
> > >>>> can be done very efficiently by first verifying the entire buffer,
> and
> > >> then
> > >>>> verifying the offsets correspond to the start of a UTF-8 codepoint.
> > This
> > >>>> same approach would not be possible for StringView, which would need
> > to
> > >>>> verify individual values and would therefore be significantly more
> > >>>> expensive. As it is UB for a Rust string to contain non-UTF-8 data,
> > this
> > >>>> validation is perhaps more important for Rust than for other
> > languages.
> > >>>>
> > >>>> I presume that StringView will behave similarly to dictionaries in
> > that
> > >>>> the selection kernels will not recompute the underlying value
> > buffers. I
> > >>>> think this is fine, but it is perhaps worth noting this has caused
> > >>>> confusion in the past, as people somewhat reasonably expect an array
> > >>>> post-selection to have memory usage reflecting the smaller
> selection.
> > >> This
> > >>>> is then especially noticeable if the data is written out to IPC, and
> > >> still
> > >>>> contains data that was supposedly filtered out. My 2 cents is that
> > >> explicit
> > >>>> selection vectors are a less surprising way to defer selection than
> > >> baking
> > >>>> it into the array, but I also don't have any workloads where this is
> > the
> > >>>> major bottleneck so can't speak authoritatively here.
> > >>>>
> > >>>> Which leads on to my major concern with this proposal, that it adds
> > >>>> complexity and cognitive load to the specification and
> > implementations,
> > >>>> whilst not meaningfully improving the performance of the operators
> > that
> > >> I
> > >>>> commonly encounter as performance bottlenecks, which are
> multi-column
> > >> sorts
> > >>>> and aggregations, or the expensive string operations such as
> matching
> > or
> > >>>> parsing. If we didn't already have a string representation I would
> be
> > >> more
> > >>>> onboard, but as it stands I'm definitely on the fence, especially
> > given
> > >>>> selection performance can be improved in less intrusive ways using
> > >>>> dictionaries or selection vectors.
> > >>>>
> > >>>> Kind Regards,
> > >>>>
> > >>>> Raphael Taylor-Davies
> > >>>>
> > >>>> On 02/07/2023 11:46, Andrew Lamb wrote:
> > >>>>
> > >>>>    * This is the first layout where the number of buffers depends on
> > the
> > >>>>
> > >>>> data
> > >>>>
> > >>>> and not the schema. I think this is the most architecturally
> > significant
> > >>>> fact. I
> > >>>>
> > >>>>    I have spent some time reading the initial proposal -- thank you
> > for
> > >>>> that. I now understand what Weston was saying about the "variable
> > >> numbers
> > >>>> of buffers". I wonder if you considered restricting such arrays to a
> > >> single
> > >>>> buffer (so as to make them more similar to other arrow array types
> > that
> > >>>> have a fixed number of buffers)? On Tue, Jun 20, 2023 at 11:33 AM
> > Weston
> > >>>> Pace<weston.p...@gmail.com>  <mailto:weston.p...@gmail.com>  wrote:
> > >>>>
> > >>>> Before I say anything else I'll say that I am in favor of this new
> > >> layout.
> > >>>> There is some existing literature on the idea (e.g. umbra) and your
> > >>>> benchmarks show some nice improvements. Compared to some of the
> other
> > >>>> layouts we've discussed recently (REE, list veiw) I do think this
> > >> layout is
> > >>>> more unique and fundamentally different. Perhaps most fundamentally
> > >>>> different: * This is the first layout where the number of buffers
> > >> depends
> > >>>> on the data and not the schema. I think this is the most
> > architecturally
> > >>>> significant fact. It does require a (backwards compatible) change to
> > the
> > >>>> IPC format itself, beyond just adding new type codes. It also poses
> > >>>> challenges in places where we've assumed there will be at most 3
> > buffers
> > >>>> (e.g. in ArraySpan, though, as you have shown, we can work around
> this
> > >>>> using a raw pointers representation internally in those spots). I
> > think
> > >>>> you've done some great work to integrate this well with Arrow-C++
> and
> > >> I'm
> > >>>> convinced it can work. I would be interested in hearing some input
> > from
> > >> the
> > >>>> Rust community. Ben, at one point there was some discussion that
> this
> > >> might
> > >>>> be a c-data only type. However, I believe that was based on the raw
> > >>>> pointers representation. What you've proposed here, if I understand
> > >>>> correctly, is an index + offsets representation and it is suitable
> for
> > >> IPC
> > >>>> correct? (e.g. I see that you have changes and examples in the IPC
> > >>>> reader/writer) On Mon, Jun 19, 2023 at 7:17 AM Benjamin Kietzman <
> > >>>> bengil...@gmail.com> <mailto:bengil...@gmail.com>  wrote:
> > >>>>
> > >>>> Hi Gang, I'm not sure what you mean, sorry if my answers are off
> base:
> > >>>> Parquet's ByteArray will be unaffected by the addition of the string
> > >> view
> > >>>> type; all arrow strings
> > (arrow::Type::STRING,arrow::Type::LARGE_STRING,
> > >>>> and with this patcharrow::Type::STRING_VIEW) are converted to
> > ByteArrays
> > >>>> during serialization to parquet [1]. If you mean that encoding of
> > >>>> arrow::Type::STRING_VIEW  will not be as fast as encoding of
> > equivalent
> > >>>> arrow::Type::STRING, that's something I haven't benchmarked so I
> can't
> > >>>> answer definitively. I would expect it to be
> > >>>>
> > >>>> faster
> > >>>>
> > >>>> than first converting STRING_VIEW->STRING then encoding to parquet;
> > >> direct
> > >>>> encoding avoids allocating and populating temporary buffers. Of
> course
> > >>>>
> > >>>> this
> > >>>>
> > >>>> only applies to cases where you need to encode an array of
> STRING_VIEW
> > >> to
> > >>>> parquet- encoding of STRING to parquet will be unaffected.
> Sincerely,
> > >> Ben
> > >>>> [1]
> > >>>>
> > >>>>
> > >>>>
> > >>
> >
> https://github.com/bkietz/arrow/blob/46cf7e67766f0646760acefa4d2d01cdfead2d5d/cpp/src/parquet/encoding.cc#L166-L179
> > >>>>    On Thu, Jun 15, 2023 at 10:34 PM Gang Wu<ust...@gmail.com>
> > <mailto:
> > >> ust...@gmail.com>  wrote:
> > >>>> Hi Ben, The posted benchmark [1] looks pretty good to me. However, I
> > >> want
> > >>>> to raise a possible issue from the perspective of parquet-cpp.
> > >> Parquet-cpp
> > >>>> uses a customizedparquet::ByteArray  type [2] for string/binary, I
> > >>>>
> > >>>> would
> > >>>>
> > >>>> expect some regression of conversions between parquet reader/writer
> > and
> > >>>> the proposed string view array, especially when some strings use
> short
> > >> form
> > >>>> and others use long form. [1]
> > >>>>
> > >>>>
> > >>>>
> > >>
> >
> https://github.com/apache/arrow/blob/41309de8dd91a9821873fc5f94339f0542ca0108/cpp/src/parquet/types.h#L575
> > >>>> [2]
> https://github.com/apache/arrow/pull/35628#issuecomment-1583218617
> > >>>> Best, Gang On Fri, Jun 16, 2023 at 3:58 AM Will Jones <
> > >>>> will.jones...@gmail.com> <mailto:will.jones...@gmail.com>  wrote:
> > >>>>
> > >>>> Cool. Thanks for doing that! On Thu, Jun 15, 2023 at 12:40 Benjamin
> > >>>> Kietzman <bengil...@gmail.com  <mailto:bengil...@gmail.com>
> > >>>>
> > >>>> wrote:
> > >>>>
> > >>>> I've addedhttps://github.com/apache/arrow/issues/36112  to track
> > >>>> deduplication of buffers on write. I don't think it would require
> > >>>> modification of the IPC format. Ben On Thu, Jun 15, 2023 at 1:30 PM
> > Matt
> > >>>> Topol <zotthewiz...@gmail.com  <mailto:zotthewiz...@gmail.com>
> > >>>>
> > >>>> wrote:
> > >>>>
> > >>>> Based on my understanding, in theory a buffer *could* be shared
> > >>>>
> > >>>> within
> > >>>>
> > >>>> a
> > >>>>
> > >>>> batch since the flatbuffers message just uses an offset and
> > >>>>
> > >>>> length
> > >>>>
> > >>>> to
> > >>>>
> > >>>> identify the buffers. That said, I don't believe any current
> > >>>> implementation actually
> > >>>>
> > >>>> does
> > >>>>
> > >>>> this
> > >>>>
> > >>>> or
> > >>>>
> > >>>> takes advantage of this in any meaningful way. --Matt On Thu, Jun
> 15,
> > >> 2023
> > >>>> at 1:00 PM Will Jones <
> > >>>>
> > >>>> will.jones...@gmail.com  <mailto:will.jones...@gmail.com>>
> > >>>>
> > >>>> wrote:
> > >>>>
> > >>>> Hi Ben, It's exciting to see this move along. The buffers will be
> > >>>> duplicated. If buffer duplication is
> > >>>>
> > >>>> becomes
> > >>>>
> > >>>> a
> > >>>>
> > >>>> concern,
> > >>>>
> > >>>> I'd prefer to handle that in the ipc writer. Then buffers which are
> > >>>> duplicated
> > >>>>
> > >>>> could
> > >>>>
> > >>>> be
> > >>>>
> > >>>> detected
> > >>>>
> > >>>> by checking pointer identity and written only once.
> > >>>>
> > >>>>    Question: to be able to write buffer only once and reference in
> > >>>>
> > >>>> multiple
> > >>>>
> > >>>> arrays, does that require a change to the IPC format? Or is
> > >>>>
> > >>>> sharing
> > >>>>
> > >>>> buffers
> > >>>>
> > >>>> within the same batch already allowed in the IPC format? Best, Will
> > >> Jones
> > >>>> On Thu, Jun 15, 2023 at 9:03 AM Benjamin Kietzman <
> > >>>>
> > >>>> bengil...@gmail.com  <mailto:bengil...@gmail.com>
> > >>>>
> > >>>> wrote:
> > >>>>
> > >>>> Hello again all, The PR [1] to add string view to the format and the
> > C++
> > >>>>
> > >>>> implementation
> > >>>>
> > >>>> is
> > >>>>
> > >>>> hovering around passing CI and has been undrafted.
> > >>>>
> > >>>> Furthermore,
> > >>>>
> > >>>> there
> > >>>>
> > >>>> is
> > >>>>
> > >>>> now also a PR [2] to add string view to the Go
> > >>>>
> > >>>> implementation.
> > >>>>
> > >>>> Code
> > >>>>
> > >>>> review
> > >>>>
> > >>>> is underway for each PR and I'd like to move toward a vote
> > >>>>
> > >>>> for
> > >>>>
> > >>>> acceptance-
> > >>>>
> > >>>> are there any other preliminaries which I've neglected? To reiterate
> > the
> > >>>> answers to some past questions: - Benchmarks are added in the C++ PR
> > >> [1] to
> > >>>> demonstrate the
> > >>>>
> > >>>> performance
> > >>>>
> > >>>> of
> > >>>>
> > >>>>    conversion between the various string formats. In addition,
> > >>>>
> > >>>> there
> > >>>>
> > >>>> are
> > >>>>
> > >>>>    some benchmarks which demonstrate the performance gains
> > >>>>
> > >>>> available
> > >>>>
> > >>>> with
> > >>>>
> > >>>>    the new format [3]. - Adding string view to the C ABI is a
> natural
> > >> follow
> > >>>> up, but
> > >>>>
> > >>>> should
> > >>>>
> > >>>> be
> > >>>>
> > >>>>    handled independently. An issue has been added to track
> > >>>>
> > >>>> that
> > >>>>
> > >>>>    enhancement [4]. Sincerely, Ben Kietzman [1]
> > >>>> https://github.com/apache/arrow/pull/35628  [2]
> > >>>> https://github.com/apache/arrow/pull/35769  [3]
> > >>>>
> > >>>> https://github.com/apache/arrow/pull/35628#issuecomment-1583218617
> > >>>>
> > >>>> [4]https://github.com/apache/arrow/issues/36099  On Wed, May 17,
> 2023
> > >> at
> > >>>> 12:53 PM Benjamin Kietzman <
> > >>>>
> > >>>> bengil...@gmail.com  <mailto:bengil...@gmail.com>>
> > >>>>
> > >>>> wrote:
> > >>>>
> > >>>> @Jacob
> > >>>>
> > >>>> You mention benchmarks multiple times, are these results
> > >>>>
> > >>>> published
> > >>>>
> > >>>> somewhere? I benchmarked the performance of raw pointer vs index
> > >>>>
> > >>>> offset
> > >>>>
> > >>>> views
> > >>>>
> > >>>> in
> > >>>>
> > >>>> my
> > >>>>
> > >>>> PR to velox, I do intend to port them to my arrow PR but I haven't
> > >>>>
> > >>>> gotten
> > >>>>
> > >>>> there
> > >>>>
> > >>>> yet.
> > >>>>
> > >>>> Furthermore, it seemed less urgent to me since coexistence of the
> two
> > >>>> types
> > >>>>
> > >>>> in
> > >>>>
> > >>>> the
> > >>>>
> > >>>> c++
> > >>>>
> > >>>> implementation defers the question of how aggressively one should be
> > >>>>
> > >>>> preferred
> > >>>>
> > >>>> over
> > >>>>
> > >>>> the
> > >>>>
> > >>>> other. @Dewey
> > >>>>
> > >>>> I don't see the C Data interface in the PR
> > >>>>
> > >>>>    I have not addressed the C ABI in this PR. As you mention,
> > >>>>
> > >>>> it
> > >>>>
> > >>>> may
> > >>>>
> > >>>> be
> > >>>>
> > >>>> useful to transmit arrays with raw pointer views between
> > implementations
> > >>>> which
> > >>>>
> > >>>> allow
> > >>>>
> > >>>> them. I
> > >>>>
> > >>>> can address this in a follow up PR. @Will
> > >>>>
> > >>>> If I understand correctly, multiple arrays can reference
> > >>>>
> > >>>> the
> > >>>>
> > >>>> same
> > >>>>
> > >>>> buffers
> > >>>>
> > >>>> in memory, but once they are written to IPC their data
> >
>

Re: [DISCUSS][Format] Draft implementation of string view array format

Reply via email to