Hi Ben, It's exciting to see this move along.
The buffers will be duplicated. If buffer duplication is becomes a concern, > I'd prefer to handle > that in the ipc writer. Then buffers which are duplicated could be detected > by checking > pointer identity and written only once. Question: to be able to write buffer only once and reference in multiple arrays, does that require a change to the IPC format? Or is sharing buffers within the same batch already allowed in the IPC format? Best, Will Jones On Thu, Jun 15, 2023 at 9:03 AM Benjamin Kietzman <bengil...@gmail.com> wrote: > Hello again all, > > The PR [1] to add string view to the format and the C++ implementation is > hovering around passing CI and has been undrafted. Furthermore, there is > now also a PR [2] to add string view to the Go implementation. Code review > is underway for each PR and I'd like to move toward a vote for acceptance- > are there any other preliminaries which I've neglected? > > To reiterate the answers to some past questions: > - Benchmarks are added in the C++ PR [1] to demonstrate the performance of > conversion between the various string formats. In addition, there are > some benchmarks which demonstrate the performance gains available with > the new format [3]. > - Adding string view to the C ABI is a natural follow up, but should be > handled independently. An issue has been added to track that > enhancement [4]. > > Sincerely, > Ben Kietzman > > [1] https://github.com/apache/arrow/pull/35628 > [2] https://github.com/apache/arrow/pull/35769 > [3] https://github.com/apache/arrow/pull/35628#issuecomment-1583218617 > [4] https://github.com/apache/arrow/issues/36099 > > On Wed, May 17, 2023 at 12:53 PM Benjamin Kietzman <bengil...@gmail.com> > wrote: > > > @Jacob > > > You mention benchmarks multiple times, are these results published > > somewhere? > > > > I benchmarked the performance of raw pointer vs index offset views in my > > PR to velox, > > I do intend to port them to my arrow PR but I haven't gotten there yet. > > Furthermore, it > > seemed less urgent to me since coexistence of the two types in the c++ > > implementation > > defers the question of how aggressively one should be preferred over the > > other. > > > > @Dewey > > > I don't see the C Data interface in the PR > > > > I have not addressed the C ABI in this PR. As you mention, it may be > > useful to transmit > > arrays with raw pointer views between implementations which allow them. I > > can address > > this in a follow up PR. > > > > @Will > > > If I understand correctly, multiple arrays can reference the same > buffers > > > in memory, but once they are written to IPC their data buffers will be > > > duplicated. Is that right? > > The buffers will be duplicated. If buffer duplication is becomes a > > concern, I'd prefer to handle > > that in the ipc writer. Then buffers which are duplicated could be > > detected by checking > > pointer identity and written only once. > > > > > > On Wed, May 17, 2023 at 12:07 AM Will Jones <will.jones...@gmail.com> > > wrote: > > > >> Hello Ben, > >> > >> Thanks for your work on this. I think this will be an excellent addition > >> to > >> the format. > >> > >> If I understand correctly, multiple arrays can reference the same > buffers > >> in memory, but once they are written to IPC their data buffers will be > >> duplicated. Is that right? > >> > >> Dictionary types have a special message so they can be reused across > >> batches and even fields. Did we consider adding a similar message for > >> string view buffers? > >> > >> One relevant use case I'm curious about is substring extraction. For > >> example, if I have a column of URIs and I create columns where I've > >> extracted substrings like the hostname, path, and a list column of query > >> parameters, I'd like for those latter columns to be views into the URI > >> buffers, rather than full copies. > >> > >> However, I've never touched the IPC read code paths, so it's quite > >> possible > >> I'm overlooking something obvious. > >> > >> Best, > >> > >> Will Jones > >> > >> > >> On Tue, May 16, 2023 at 6:29 PM Dewey Dunnington > >> <de...@voltrondata.com.invalid> wrote: > >> > >> > Very cool! > >> > > >> > In addition to performance mentioned above, I could see this being > >> > useful for the R bindings - we already have a global string pool and a > >> > mechanism for keeping a vector of them alive. > >> > > >> > I don't see the C Data interface in the PR although I may have missed > >> > it - is that a part of the proposal? It seems like it would be > >> > possible to use raw pointers as long as they can be guaranteed to be > >> > valid until the release callback is called? > >> > > >> > On Tue, May 16, 2023 at 8:43 PM Jacob Wujciak > >> > <ja...@voltrondata.com.invalid> wrote: > >> > > > >> > > Hello Everyone, > >> > > I think keeping interoperability with the large ecosystem is a very > >> > > important goal for arrow so I am overall in favor of this proposal! > >> > > > >> > > You mention benchmarks multiple times, are these results published > >> > > somewhere? > >> > > > >> > > Thanks > >> > > > >> > > On Tue, May 16, 2023 at 11:39 PM Benjamin Kietzman < > >> bengil...@gmail.com> > >> > > wrote: > >> > > > >> > > > Hello all, > >> > > > > >> > > > As previously discussed on this list [1], an UmbraDB/DuckDB/Velox > >> > > > compatible > >> > > > "string view" type could bring several performance benefits to > >> access > >> > and > >> > > > authoring of string data in the arrow format [2]. Additionally > >> better > >> > > > interoperability with engines already using this format could be > >> > > > established. > >> > > > > >> > > > PR #0 [3] adds Utf8View and BinaryView types to the C++ > >> implementation > >> > and > >> > > > to > >> > > > the IPC format. For the purposes of IPC raw pointers are not used. > >> > Instead, > >> > > > each view contains a pair of 32 bit unsigned integers which encode > >> the > >> > > > index of > >> > > > a character buffer (string view arrays may consist of a variable > >> > number of > >> > > > such buffers) and the offset of a view's data within that buffer > >> > > > respectively. > >> > > > Benefits of this substitution include: > >> > > > - This makes explicit the guarantee that lifetime of all character > >> > data is > >> > > > equal > >> > > > to that of the array which views it, which is critical for > >> confident > >> > > > consumption across an interface boundary. > >> > > > - As with other types in the arrow format, such arrays are > >> > serializable and > >> > > > venue agnostic; directly usable in shared memory without > >> > modification. > >> > > > - Indices and offsets are easily validated. > >> > > > > >> > > > Accessing the data requires some trivial pointer arithmetic, but > in > >> > > > benchmarking > >> > > > this had negligible impact on sequential access and only minor > >> impact > >> > on > >> > > > random > >> > > > access. > >> > > > > >> > > > In the C++ implementation, raw pointer string views are supported > >> as an > >> > > > extended > >> > > > case of the Utf8View type: `utf8_view(/*has_raw_pointers=*/true)`. > >> > > > Branching on > >> > > > this access pattern bit at the data type level has negligible > >> impact on > >> > > > performance since the branch resides outside any hot loops. > Utility > >> > > > functions > >> > > > are provided for efficient (potentially in-place) conversion > between > >> > raw > >> > > > pointer > >> > > > and index offset views. For example, the C++ implementation could > >> zero > >> > copy > >> > > > a raw pointer array from Velox, filter it, then convert to > >> > index/offset for > >> > > > serialization. Other implementations may choose to accommodate or > >> > eschew > >> > > > raw > >> > > > pointer views as their communities direct. > >> > > > > >> > > > Where desirous in a rigorously controlled context this still > enables > >> > > > construction > >> > > > and safe consumption of string view arrays which reference memory > >> not > >> > > > directly bound to the lifetime of the array. I'm not sure when or > >> if we > >> > > > would > >> > > > find it useful to have arrays like this; I do not introduce any in > >> > [3]. I > >> > > > mention > >> > > > this possibility to highlight that if benchmarking demonstrates > that > >> > such > >> > > > an > >> > > > approach brings a significant performance benefit to some > operation, > >> > the > >> > > > only > >> > > > barrier to its adoption would be code review. Likewise if more > >> > intensive > >> > > > benchmarking determines that raw pointer views critically > outperform > >> > > > index/offset > >> > > > views for real-world analytics tasks, prioritizing raw pointer > >> string > >> > views > >> > > > for usage within the C++ implementation will be straightforward. > >> > > > > >> > > > See also the proposal to Velox that their string view vector be > >> > refactored > >> > > > in a similar vein [4]. > >> > > > > >> > > > Sincerely, > >> > > > Ben Kietzman > >> > > > > >> > > > [1] > >> https://lists.apache.org/thread/49qzofswg1r5z7zh39pjvd1m2ggz2kdq > >> > > > [2] http://cidrdb.org/cidr2020/papers/p29-neumann-cidr20.pdf > >> > > > [3] https://github.com/apache/arrow/pull/35628 > >> > > > [4] https://github.com/facebookincubator/velox/discussions/4362 > >> > > > > >> > > >> > > >