Before I say anything else I'll say that I am in favor of this new layout. There is some existing literature on the idea (e.g. umbra) and your benchmarks show some nice improvements.
Compared to some of the other layouts we've discussed recently (REE, list veiw) I do think this layout is more unique and fundamentally different. Perhaps most fundamentally different: * This is the first layout where the number of buffers depends on the data and not the schema. I think this is the most architecturally significant fact. It does require a (backwards compatible) change to the IPC format itself, beyond just adding new type codes. It also poses challenges in places where we've assumed there will be at most 3 buffers (e.g. in ArraySpan, though, as you have shown, we can work around this using a raw pointers representation internally in those spots). I think you've done some great work to integrate this well with Arrow-C++ and I'm convinced it can work. I would be interested in hearing some input from the Rust community. Ben, at one point there was some discussion that this might be a c-data only type. However, I believe that was based on the raw pointers representation. What you've proposed here, if I understand correctly, is an index + offsets representation and it is suitable for IPC correct? (e.g. I see that you have changes and examples in the IPC reader/writer) On Mon, Jun 19, 2023 at 7:17 AM Benjamin Kietzman <bengil...@gmail.com> wrote: > Hi Gang, > > I'm not sure what you mean, sorry if my answers are off base: > > Parquet's ByteArray will be unaffected by the addition of the string view > type; > all arrow strings (arrow::Type::STRING, arrow::Type::LARGE_STRING, and > with this patch arrow::Type::STRING_VIEW) are converted to ByteArrays > during serialization to parquet [1]. > > If you mean that encoding of arrow::Type::STRING_VIEW will not be as fast > as encoding of equivalent arrow::Type::STRING, that's something I haven't > benchmarked so I can't answer definitively. I would expect it to be faster > than > first converting STRING_VIEW->STRING then encoding to parquet; direct > encoding avoids allocating and populating temporary buffers. Of course this > only applies to cases where you need to encode an array of STRING_VIEW to > parquet- encoding of STRING to parquet will be unaffected. > > Sincerely, > Ben > > [1] > > https://github.com/bkietz/arrow/blob/46cf7e67766f0646760acefa4d2d01cdfead2d5d/cpp/src/parquet/encoding.cc#L166-L179 > > On Thu, Jun 15, 2023 at 10:34 PM Gang Wu <ust...@gmail.com> wrote: > > > Hi Ben, > > > > The posted benchmark [1] looks pretty good to me. However, I want to > > raise a possible issue from the perspective of parquet-cpp. Parquet-cpp > > uses a customized parquet::ByteArray type [2] for string/binary, I would > > expect some regression of conversions between parquet reader/writer > > and the proposed string view array, especially when some strings use > > short form and others use long form. > > > > [1] > > > > > https://github.com/apache/arrow/blob/41309de8dd91a9821873fc5f94339f0542ca0108/cpp/src/parquet/types.h#L575 > > [2] https://github.com/apache/arrow/pull/35628#issuecomment-1583218617 > > > > Best, > > Gang > > > > On Fri, Jun 16, 2023 at 3:58 AM Will Jones <will.jones...@gmail.com> > > wrote: > > > > > Cool. Thanks for doing that! > > > > > > On Thu, Jun 15, 2023 at 12:40 Benjamin Kietzman <bengil...@gmail.com> > > > wrote: > > > > > > > I've added https://github.com/apache/arrow/issues/36112 to track > > > > deduplication of buffers on write. > > > > I don't think it would require modification of the IPC format. > > > > > > > > Ben > > > > > > > > On Thu, Jun 15, 2023 at 1:30 PM Matt Topol <zotthewiz...@gmail.com> > > > wrote: > > > > > > > > > Based on my understanding, in theory a buffer *could* be shared > > within > > > a > > > > > batch since the flatbuffers message just uses an offset and length > to > > > > > identify the buffers. > > > > > > > > > > That said, I don't believe any current implementation actually does > > > this > > > > or > > > > > takes advantage of this in any meaningful way. > > > > > > > > > > --Matt > > > > > > > > > > On Thu, Jun 15, 2023 at 1:00 PM Will Jones < > will.jones...@gmail.com> > > > > > wrote: > > > > > > > > > > > Hi Ben, > > > > > > > > > > > > It's exciting to see this move along. > > > > > > > > > > > > The buffers will be duplicated. If buffer duplication is becomes > a > > > > > concern, > > > > > > > I'd prefer to handle > > > > > > > that in the ipc writer. Then buffers which are duplicated could > > be > > > > > > detected > > > > > > > by checking > > > > > > > pointer identity and written only once. > > > > > > > > > > > > > > > > > > Question: to be able to write buffer only once and reference in > > > > multiple > > > > > > arrays, does that require a change to the IPC format? Or is > sharing > > > > > buffers > > > > > > within the same batch already allowed in the IPC format? > > > > > > > > > > > > Best, > > > > > > > > > > > > Will Jones > > > > > > > > > > > > On Thu, Jun 15, 2023 at 9:03 AM Benjamin Kietzman < > > > bengil...@gmail.com > > > > > > > > > > > wrote: > > > > > > > > > > > > > Hello again all, > > > > > > > > > > > > > > The PR [1] to add string view to the format and the C++ > > > > implementation > > > > > is > > > > > > > hovering around passing CI and has been undrafted. Furthermore, > > > there > > > > > is > > > > > > > now also a PR [2] to add string view to the Go implementation. > > Code > > > > > > review > > > > > > > is underway for each PR and I'd like to move toward a vote for > > > > > > acceptance- > > > > > > > are there any other preliminaries which I've neglected? > > > > > > > > > > > > > > To reiterate the answers to some past questions: > > > > > > > - Benchmarks are added in the C++ PR [1] to demonstrate the > > > > performance > > > > > > of > > > > > > > conversion between the various string formats. In addition, > > there > > > > are > > > > > > > some benchmarks which demonstrate the performance gains > > available > > > > > with > > > > > > > the new format [3]. > > > > > > > - Adding string view to the C ABI is a natural follow up, but > > > should > > > > be > > > > > > > handled independently. An issue has been added to track that > > > > > > > enhancement [4]. > > > > > > > > > > > > > > Sincerely, > > > > > > > Ben Kietzman > > > > > > > > > > > > > > [1] https://github.com/apache/arrow/pull/35628 > > > > > > > [2] https://github.com/apache/arrow/pull/35769 > > > > > > > [3] > > > > https://github.com/apache/arrow/pull/35628#issuecomment-1583218617 > > > > > > > [4] https://github.com/apache/arrow/issues/36099 > > > > > > > > > > > > > > On Wed, May 17, 2023 at 12:53 PM Benjamin Kietzman < > > > > > bengil...@gmail.com> > > > > > > > wrote: > > > > > > > > > > > > > > > @Jacob > > > > > > > > > You mention benchmarks multiple times, are these results > > > > published > > > > > > > > somewhere? > > > > > > > > > > > > > > > > I benchmarked the performance of raw pointer vs index offset > > > views > > > > in > > > > > > my > > > > > > > > PR to velox, > > > > > > > > I do intend to port them to my arrow PR but I haven't gotten > > > there > > > > > yet. > > > > > > > > Furthermore, it > > > > > > > > seemed less urgent to me since coexistence of the two types > in > > > the > > > > > c++ > > > > > > > > implementation > > > > > > > > defers the question of how aggressively one should be > preferred > > > > over > > > > > > the > > > > > > > > other. > > > > > > > > > > > > > > > > @Dewey > > > > > > > > > I don't see the C Data interface in the PR > > > > > > > > > > > > > > > > I have not addressed the C ABI in this PR. As you mention, it > > may > > > > be > > > > > > > > useful to transmit > > > > > > > > arrays with raw pointer views between implementations which > > allow > > > > > > them. I > > > > > > > > can address > > > > > > > > this in a follow up PR. > > > > > > > > > > > > > > > > @Will > > > > > > > > > If I understand correctly, multiple arrays can reference > the > > > same > > > > > > > buffers > > > > > > > > > in memory, but once they are written to IPC their data > > buffers > > > > will > > > > > > be > > > > > > > > > duplicated. Is that right? > > > > > > > > The buffers will be duplicated. If buffer duplication is > > becomes > > > a > > > > > > > > concern, I'd prefer to handle > > > > > > > > that in the ipc writer. Then buffers which are duplicated > could > > > be > > > > > > > > detected by checking > > > > > > > > pointer identity and written only once. > > > > > > > > > > > > > > > > > > > > > > > > On Wed, May 17, 2023 at 12:07 AM Will Jones < > > > > will.jones...@gmail.com > > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > >> Hello Ben, > > > > > > > >> > > > > > > > >> Thanks for your work on this. I think this will be an > > excellent > > > > > > addition > > > > > > > >> to > > > > > > > >> the format. > > > > > > > >> > > > > > > > >> If I understand correctly, multiple arrays can reference the > > > same > > > > > > > buffers > > > > > > > >> in memory, but once they are written to IPC their data > buffers > > > > will > > > > > be > > > > > > > >> duplicated. Is that right? > > > > > > > >> > > > > > > > >> Dictionary types have a special message so they can be > reused > > > > across > > > > > > > >> batches and even fields. Did we consider adding a similar > > > message > > > > > for > > > > > > > >> string view buffers? > > > > > > > >> > > > > > > > >> One relevant use case I'm curious about is substring > > extraction. > > > > For > > > > > > > >> example, if I have a column of URIs and I create columns > where > > > > I've > > > > > > > >> extracted substrings like the hostname, path, and a list > > column > > > of > > > > > > query > > > > > > > >> parameters, I'd like for those latter columns to be views > into > > > the > > > > > URI > > > > > > > >> buffers, rather than full copies. > > > > > > > >> > > > > > > > >> However, I've never touched the IPC read code paths, so it's > > > quite > > > > > > > >> possible > > > > > > > >> I'm overlooking something obvious. > > > > > > > >> > > > > > > > >> Best, > > > > > > > >> > > > > > > > >> Will Jones > > > > > > > >> > > > > > > > >> > > > > > > > >> On Tue, May 16, 2023 at 6:29 PM Dewey Dunnington > > > > > > > >> <de...@voltrondata.com.invalid> wrote: > > > > > > > >> > > > > > > > >> > Very cool! > > > > > > > >> > > > > > > > > >> > In addition to performance mentioned above, I could see > this > > > > being > > > > > > > >> > useful for the R bindings - we already have a global > string > > > pool > > > > > > and a > > > > > > > >> > mechanism for keeping a vector of them alive. > > > > > > > >> > > > > > > > > >> > I don't see the C Data interface in the PR although I may > > have > > > > > > missed > > > > > > > >> > it - is that a part of the proposal? It seems like it > would > > be > > > > > > > >> > possible to use raw pointers as long as they can be > > guaranteed > > > > to > > > > > be > > > > > > > >> > valid until the release callback is called? > > > > > > > >> > > > > > > > > >> > On Tue, May 16, 2023 at 8:43 PM Jacob Wujciak > > > > > > > >> > <ja...@voltrondata.com.invalid> wrote: > > > > > > > >> > > > > > > > > > >> > > Hello Everyone, > > > > > > > >> > > I think keeping interoperability with the large > ecosystem > > > is a > > > > > > very > > > > > > > >> > > important goal for arrow so I am overall in favor of > this > > > > > > proposal! > > > > > > > >> > > > > > > > > > >> > > You mention benchmarks multiple times, are these results > > > > > published > > > > > > > >> > > somewhere? > > > > > > > >> > > > > > > > > > >> > > Thanks > > > > > > > >> > > > > > > > > > >> > > On Tue, May 16, 2023 at 11:39 PM Benjamin Kietzman < > > > > > > > >> bengil...@gmail.com> > > > > > > > >> > > wrote: > > > > > > > >> > > > > > > > > > >> > > > Hello all, > > > > > > > >> > > > > > > > > > > >> > > > As previously discussed on this list [1], an > > > > > > UmbraDB/DuckDB/Velox > > > > > > > >> > > > compatible > > > > > > > >> > > > "string view" type could bring several performance > > > benefits > > > > to > > > > > > > >> access > > > > > > > >> > and > > > > > > > >> > > > authoring of string data in the arrow format [2]. > > > > Additionally > > > > > > > >> better > > > > > > > >> > > > interoperability with engines already using this > format > > > > could > > > > > be > > > > > > > >> > > > established. > > > > > > > >> > > > > > > > > > > >> > > > PR #0 [3] adds Utf8View and BinaryView types to the > C++ > > > > > > > >> implementation > > > > > > > >> > and > > > > > > > >> > > > to > > > > > > > >> > > > the IPC format. For the purposes of IPC raw pointers > are > > > not > > > > > > used. > > > > > > > >> > Instead, > > > > > > > >> > > > each view contains a pair of 32 bit unsigned integers > > > which > > > > > > encode > > > > > > > >> the > > > > > > > >> > > > index of > > > > > > > >> > > > a character buffer (string view arrays may consist of > a > > > > > variable > > > > > > > >> > number of > > > > > > > >> > > > such buffers) and the offset of a view's data within > > that > > > > > buffer > > > > > > > >> > > > respectively. > > > > > > > >> > > > Benefits of this substitution include: > > > > > > > >> > > > - This makes explicit the guarantee that lifetime of > all > > > > > > character > > > > > > > >> > data is > > > > > > > >> > > > equal > > > > > > > >> > > > to that of the array which views it, which is > critical > > > for > > > > > > > >> confident > > > > > > > >> > > > consumption across an interface boundary. > > > > > > > >> > > > - As with other types in the arrow format, such arrays > > are > > > > > > > >> > serializable and > > > > > > > >> > > > venue agnostic; directly usable in shared memory > > without > > > > > > > >> > modification. > > > > > > > >> > > > - Indices and offsets are easily validated. > > > > > > > >> > > > > > > > > > > >> > > > Accessing the data requires some trivial pointer > > > arithmetic, > > > > > but > > > > > > > in > > > > > > > >> > > > benchmarking > > > > > > > >> > > > this had negligible impact on sequential access and > only > > > > minor > > > > > > > >> impact > > > > > > > >> > on > > > > > > > >> > > > random > > > > > > > >> > > > access. > > > > > > > >> > > > > > > > > > > >> > > > In the C++ implementation, raw pointer string views > are > > > > > > supported > > > > > > > >> as an > > > > > > > >> > > > extended > > > > > > > >> > > > case of the Utf8View type: > > > > > > `utf8_view(/*has_raw_pointers=*/true)`. > > > > > > > >> > > > Branching on > > > > > > > >> > > > this access pattern bit at the data type level has > > > > negligible > > > > > > > >> impact on > > > > > > > >> > > > performance since the branch resides outside any hot > > > loops. > > > > > > > Utility > > > > > > > >> > > > functions > > > > > > > >> > > > are provided for efficient (potentially in-place) > > > conversion > > > > > > > between > > > > > > > >> > raw > > > > > > > >> > > > pointer > > > > > > > >> > > > and index offset views. For example, the C++ > > > implementation > > > > > > could > > > > > > > >> zero > > > > > > > >> > copy > > > > > > > >> > > > a raw pointer array from Velox, filter it, then > convert > > to > > > > > > > >> > index/offset for > > > > > > > >> > > > serialization. Other implementations may choose to > > > > accommodate > > > > > > or > > > > > > > >> > eschew > > > > > > > >> > > > raw > > > > > > > >> > > > pointer views as their communities direct. > > > > > > > >> > > > > > > > > > > >> > > > Where desirous in a rigorously controlled context this > > > still > > > > > > > enables > > > > > > > >> > > > construction > > > > > > > >> > > > and safe consumption of string view arrays which > > reference > > > > > > memory > > > > > > > >> not > > > > > > > >> > > > directly bound to the lifetime of the array. I'm not > > sure > > > > when > > > > > > or > > > > > > > >> if we > > > > > > > >> > > > would > > > > > > > >> > > > find it useful to have arrays like this; I do not > > > introduce > > > > > any > > > > > > in > > > > > > > >> > [3]. I > > > > > > > >> > > > mention > > > > > > > >> > > > this possibility to highlight that if benchmarking > > > > > demonstrates > > > > > > > that > > > > > > > >> > such > > > > > > > >> > > > an > > > > > > > >> > > > approach brings a significant performance benefit to > > some > > > > > > > operation, > > > > > > > >> > the > > > > > > > >> > > > only > > > > > > > >> > > > barrier to its adoption would be code review. Likewise > > if > > > > more > > > > > > > >> > intensive > > > > > > > >> > > > benchmarking determines that raw pointer views > > critically > > > > > > > outperform > > > > > > > >> > > > index/offset > > > > > > > >> > > > views for real-world analytics tasks, prioritizing raw > > > > pointer > > > > > > > >> string > > > > > > > >> > views > > > > > > > >> > > > for usage within the C++ implementation will be > > > > > straightforward. > > > > > > > >> > > > > > > > > > > >> > > > See also the proposal to Velox that their string view > > > vector > > > > > be > > > > > > > >> > refactored > > > > > > > >> > > > in a similar vein [4]. > > > > > > > >> > > > > > > > > > > >> > > > Sincerely, > > > > > > > >> > > > Ben Kietzman > > > > > > > >> > > > > > > > > > > >> > > > [1] > > > > > > > >> > > > https://lists.apache.org/thread/49qzofswg1r5z7zh39pjvd1m2ggz2kdq > > > > > > > >> > > > [2] > > > > http://cidrdb.org/cidr2020/papers/p29-neumann-cidr20.pdf > > > > > > > >> > > > [3] https://github.com/apache/arrow/pull/35628 > > > > > > > >> > > > [4] > > > > > https://github.com/facebookincubator/velox/discussions/4362 > > > > > > > >> > > > > > > > > > > >> > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >