Based on my understanding, in theory a buffer *could* be shared within a batch since the flatbuffers message just uses an offset and length to identify the buffers.
That said, I don't believe any current implementation actually does this or takes advantage of this in any meaningful way. --Matt On Thu, Jun 15, 2023 at 1:00 PM Will Jones <will.jones...@gmail.com> wrote: > Hi Ben, > > It's exciting to see this move along. > > The buffers will be duplicated. If buffer duplication is becomes a concern, > > I'd prefer to handle > > that in the ipc writer. Then buffers which are duplicated could be > detected > > by checking > > pointer identity and written only once. > > > Question: to be able to write buffer only once and reference in multiple > arrays, does that require a change to the IPC format? Or is sharing buffers > within the same batch already allowed in the IPC format? > > Best, > > Will Jones > > On Thu, Jun 15, 2023 at 9:03 AM Benjamin Kietzman <bengil...@gmail.com> > wrote: > > > Hello again all, > > > > The PR [1] to add string view to the format and the C++ implementation is > > hovering around passing CI and has been undrafted. Furthermore, there is > > now also a PR [2] to add string view to the Go implementation. Code > review > > is underway for each PR and I'd like to move toward a vote for > acceptance- > > are there any other preliminaries which I've neglected? > > > > To reiterate the answers to some past questions: > > - Benchmarks are added in the C++ PR [1] to demonstrate the performance > of > > conversion between the various string formats. In addition, there are > > some benchmarks which demonstrate the performance gains available with > > the new format [3]. > > - Adding string view to the C ABI is a natural follow up, but should be > > handled independently. An issue has been added to track that > > enhancement [4]. > > > > Sincerely, > > Ben Kietzman > > > > [1] https://github.com/apache/arrow/pull/35628 > > [2] https://github.com/apache/arrow/pull/35769 > > [3] https://github.com/apache/arrow/pull/35628#issuecomment-1583218617 > > [4] https://github.com/apache/arrow/issues/36099 > > > > On Wed, May 17, 2023 at 12:53 PM Benjamin Kietzman <bengil...@gmail.com> > > wrote: > > > > > @Jacob > > > > You mention benchmarks multiple times, are these results published > > > somewhere? > > > > > > I benchmarked the performance of raw pointer vs index offset views in > my > > > PR to velox, > > > I do intend to port them to my arrow PR but I haven't gotten there yet. > > > Furthermore, it > > > seemed less urgent to me since coexistence of the two types in the c++ > > > implementation > > > defers the question of how aggressively one should be preferred over > the > > > other. > > > > > > @Dewey > > > > I don't see the C Data interface in the PR > > > > > > I have not addressed the C ABI in this PR. As you mention, it may be > > > useful to transmit > > > arrays with raw pointer views between implementations which allow > them. I > > > can address > > > this in a follow up PR. > > > > > > @Will > > > > If I understand correctly, multiple arrays can reference the same > > buffers > > > > in memory, but once they are written to IPC their data buffers will > be > > > > duplicated. Is that right? > > > The buffers will be duplicated. If buffer duplication is becomes a > > > concern, I'd prefer to handle > > > that in the ipc writer. Then buffers which are duplicated could be > > > detected by checking > > > pointer identity and written only once. > > > > > > > > > On Wed, May 17, 2023 at 12:07 AM Will Jones <will.jones...@gmail.com> > > > wrote: > > > > > >> Hello Ben, > > >> > > >> Thanks for your work on this. I think this will be an excellent > addition > > >> to > > >> the format. > > >> > > >> If I understand correctly, multiple arrays can reference the same > > buffers > > >> in memory, but once they are written to IPC their data buffers will be > > >> duplicated. Is that right? > > >> > > >> Dictionary types have a special message so they can be reused across > > >> batches and even fields. Did we consider adding a similar message for > > >> string view buffers? > > >> > > >> One relevant use case I'm curious about is substring extraction. For > > >> example, if I have a column of URIs and I create columns where I've > > >> extracted substrings like the hostname, path, and a list column of > query > > >> parameters, I'd like for those latter columns to be views into the URI > > >> buffers, rather than full copies. > > >> > > >> However, I've never touched the IPC read code paths, so it's quite > > >> possible > > >> I'm overlooking something obvious. > > >> > > >> Best, > > >> > > >> Will Jones > > >> > > >> > > >> On Tue, May 16, 2023 at 6:29 PM Dewey Dunnington > > >> <de...@voltrondata.com.invalid> wrote: > > >> > > >> > Very cool! > > >> > > > >> > In addition to performance mentioned above, I could see this being > > >> > useful for the R bindings - we already have a global string pool > and a > > >> > mechanism for keeping a vector of them alive. > > >> > > > >> > I don't see the C Data interface in the PR although I may have > missed > > >> > it - is that a part of the proposal? It seems like it would be > > >> > possible to use raw pointers as long as they can be guaranteed to be > > >> > valid until the release callback is called? > > >> > > > >> > On Tue, May 16, 2023 at 8:43 PM Jacob Wujciak > > >> > <ja...@voltrondata.com.invalid> wrote: > > >> > > > > >> > > Hello Everyone, > > >> > > I think keeping interoperability with the large ecosystem is a > very > > >> > > important goal for arrow so I am overall in favor of this > proposal! > > >> > > > > >> > > You mention benchmarks multiple times, are these results published > > >> > > somewhere? > > >> > > > > >> > > Thanks > > >> > > > > >> > > On Tue, May 16, 2023 at 11:39 PM Benjamin Kietzman < > > >> bengil...@gmail.com> > > >> > > wrote: > > >> > > > > >> > > > Hello all, > > >> > > > > > >> > > > As previously discussed on this list [1], an > UmbraDB/DuckDB/Velox > > >> > > > compatible > > >> > > > "string view" type could bring several performance benefits to > > >> access > > >> > and > > >> > > > authoring of string data in the arrow format [2]. Additionally > > >> better > > >> > > > interoperability with engines already using this format could be > > >> > > > established. > > >> > > > > > >> > > > PR #0 [3] adds Utf8View and BinaryView types to the C++ > > >> implementation > > >> > and > > >> > > > to > > >> > > > the IPC format. For the purposes of IPC raw pointers are not > used. > > >> > Instead, > > >> > > > each view contains a pair of 32 bit unsigned integers which > encode > > >> the > > >> > > > index of > > >> > > > a character buffer (string view arrays may consist of a variable > > >> > number of > > >> > > > such buffers) and the offset of a view's data within that buffer > > >> > > > respectively. > > >> > > > Benefits of this substitution include: > > >> > > > - This makes explicit the guarantee that lifetime of all > character > > >> > data is > > >> > > > equal > > >> > > > to that of the array which views it, which is critical for > > >> confident > > >> > > > consumption across an interface boundary. > > >> > > > - As with other types in the arrow format, such arrays are > > >> > serializable and > > >> > > > venue agnostic; directly usable in shared memory without > > >> > modification. > > >> > > > - Indices and offsets are easily validated. > > >> > > > > > >> > > > Accessing the data requires some trivial pointer arithmetic, but > > in > > >> > > > benchmarking > > >> > > > this had negligible impact on sequential access and only minor > > >> impact > > >> > on > > >> > > > random > > >> > > > access. > > >> > > > > > >> > > > In the C++ implementation, raw pointer string views are > supported > > >> as an > > >> > > > extended > > >> > > > case of the Utf8View type: > `utf8_view(/*has_raw_pointers=*/true)`. > > >> > > > Branching on > > >> > > > this access pattern bit at the data type level has negligible > > >> impact on > > >> > > > performance since the branch resides outside any hot loops. > > Utility > > >> > > > functions > > >> > > > are provided for efficient (potentially in-place) conversion > > between > > >> > raw > > >> > > > pointer > > >> > > > and index offset views. For example, the C++ implementation > could > > >> zero > > >> > copy > > >> > > > a raw pointer array from Velox, filter it, then convert to > > >> > index/offset for > > >> > > > serialization. Other implementations may choose to accommodate > or > > >> > eschew > > >> > > > raw > > >> > > > pointer views as their communities direct. > > >> > > > > > >> > > > Where desirous in a rigorously controlled context this still > > enables > > >> > > > construction > > >> > > > and safe consumption of string view arrays which reference > memory > > >> not > > >> > > > directly bound to the lifetime of the array. I'm not sure when > or > > >> if we > > >> > > > would > > >> > > > find it useful to have arrays like this; I do not introduce any > in > > >> > [3]. I > > >> > > > mention > > >> > > > this possibility to highlight that if benchmarking demonstrates > > that > > >> > such > > >> > > > an > > >> > > > approach brings a significant performance benefit to some > > operation, > > >> > the > > >> > > > only > > >> > > > barrier to its adoption would be code review. Likewise if more > > >> > intensive > > >> > > > benchmarking determines that raw pointer views critically > > outperform > > >> > > > index/offset > > >> > > > views for real-world analytics tasks, prioritizing raw pointer > > >> string > > >> > views > > >> > > > for usage within the C++ implementation will be straightforward. > > >> > > > > > >> > > > See also the proposal to Velox that their string view vector be > > >> > refactored > > >> > > > in a similar vein [4]. > > >> > > > > > >> > > > Sincerely, > > >> > > > Ben Kietzman > > >> > > > > > >> > > > [1] > > >> https://lists.apache.org/thread/49qzofswg1r5z7zh39pjvd1m2ggz2kdq > > >> > > > [2] http://cidrdb.org/cidr2020/papers/p29-neumann-cidr20.pdf > > >> > > > [3] https://github.com/apache/arrow/pull/35628 > > >> > > > [4] https://github.com/facebookincubator/velox/discussions/4362 > > >> > > > > > >> > > > >> > > > > > >