Hi Gang, I'm not sure what you mean, sorry if my answers are off base:
Parquet's ByteArray will be unaffected by the addition of the string view type; all arrow strings (arrow::Type::STRING, arrow::Type::LARGE_STRING, and with this patch arrow::Type::STRING_VIEW) are converted to ByteArrays during serialization to parquet [1]. If you mean that encoding of arrow::Type::STRING_VIEW will not be as fast as encoding of equivalent arrow::Type::STRING, that's something I haven't benchmarked so I can't answer definitively. I would expect it to be faster than first converting STRING_VIEW->STRING then encoding to parquet; direct encoding avoids allocating and populating temporary buffers. Of course this only applies to cases where you need to encode an array of STRING_VIEW to parquet- encoding of STRING to parquet will be unaffected. Sincerely, Ben [1] https://github.com/bkietz/arrow/blob/46cf7e67766f0646760acefa4d2d01cdfead2d5d/cpp/src/parquet/encoding.cc#L166-L179 On Thu, Jun 15, 2023 at 10:34 PM Gang Wu <ust...@gmail.com> wrote: > Hi Ben, > > The posted benchmark [1] looks pretty good to me. However, I want to > raise a possible issue from the perspective of parquet-cpp. Parquet-cpp > uses a customized parquet::ByteArray type [2] for string/binary, I would > expect some regression of conversions between parquet reader/writer > and the proposed string view array, especially when some strings use > short form and others use long form. > > [1] > > https://github.com/apache/arrow/blob/41309de8dd91a9821873fc5f94339f0542ca0108/cpp/src/parquet/types.h#L575 > [2] https://github.com/apache/arrow/pull/35628#issuecomment-1583218617 > > Best, > Gang > > On Fri, Jun 16, 2023 at 3:58 AM Will Jones <will.jones...@gmail.com> > wrote: > > > Cool. Thanks for doing that! > > > > On Thu, Jun 15, 2023 at 12:40 Benjamin Kietzman <bengil...@gmail.com> > > wrote: > > > > > I've added https://github.com/apache/arrow/issues/36112 to track > > > deduplication of buffers on write. > > > I don't think it would require modification of the IPC format. > > > > > > Ben > > > > > > On Thu, Jun 15, 2023 at 1:30 PM Matt Topol <zotthewiz...@gmail.com> > > wrote: > > > > > > > Based on my understanding, in theory a buffer *could* be shared > within > > a > > > > batch since the flatbuffers message just uses an offset and length to > > > > identify the buffers. > > > > > > > > That said, I don't believe any current implementation actually does > > this > > > or > > > > takes advantage of this in any meaningful way. > > > > > > > > --Matt > > > > > > > > On Thu, Jun 15, 2023 at 1:00 PM Will Jones <will.jones...@gmail.com> > > > > wrote: > > > > > > > > > Hi Ben, > > > > > > > > > > It's exciting to see this move along. > > > > > > > > > > The buffers will be duplicated. If buffer duplication is becomes a > > > > concern, > > > > > > I'd prefer to handle > > > > > > that in the ipc writer. Then buffers which are duplicated could > be > > > > > detected > > > > > > by checking > > > > > > pointer identity and written only once. > > > > > > > > > > > > > > > Question: to be able to write buffer only once and reference in > > > multiple > > > > > arrays, does that require a change to the IPC format? Or is sharing > > > > buffers > > > > > within the same batch already allowed in the IPC format? > > > > > > > > > > Best, > > > > > > > > > > Will Jones > > > > > > > > > > On Thu, Jun 15, 2023 at 9:03 AM Benjamin Kietzman < > > bengil...@gmail.com > > > > > > > > > wrote: > > > > > > > > > > > Hello again all, > > > > > > > > > > > > The PR [1] to add string view to the format and the C++ > > > implementation > > > > is > > > > > > hovering around passing CI and has been undrafted. Furthermore, > > there > > > > is > > > > > > now also a PR [2] to add string view to the Go implementation. > Code > > > > > review > > > > > > is underway for each PR and I'd like to move toward a vote for > > > > > acceptance- > > > > > > are there any other preliminaries which I've neglected? > > > > > > > > > > > > To reiterate the answers to some past questions: > > > > > > - Benchmarks are added in the C++ PR [1] to demonstrate the > > > performance > > > > > of > > > > > > conversion between the various string formats. In addition, > there > > > are > > > > > > some benchmarks which demonstrate the performance gains > available > > > > with > > > > > > the new format [3]. > > > > > > - Adding string view to the C ABI is a natural follow up, but > > should > > > be > > > > > > handled independently. An issue has been added to track that > > > > > > enhancement [4]. > > > > > > > > > > > > Sincerely, > > > > > > Ben Kietzman > > > > > > > > > > > > [1] https://github.com/apache/arrow/pull/35628 > > > > > > [2] https://github.com/apache/arrow/pull/35769 > > > > > > [3] > > > https://github.com/apache/arrow/pull/35628#issuecomment-1583218617 > > > > > > [4] https://github.com/apache/arrow/issues/36099 > > > > > > > > > > > > On Wed, May 17, 2023 at 12:53 PM Benjamin Kietzman < > > > > bengil...@gmail.com> > > > > > > wrote: > > > > > > > > > > > > > @Jacob > > > > > > > > You mention benchmarks multiple times, are these results > > > published > > > > > > > somewhere? > > > > > > > > > > > > > > I benchmarked the performance of raw pointer vs index offset > > views > > > in > > > > > my > > > > > > > PR to velox, > > > > > > > I do intend to port them to my arrow PR but I haven't gotten > > there > > > > yet. > > > > > > > Furthermore, it > > > > > > > seemed less urgent to me since coexistence of the two types in > > the > > > > c++ > > > > > > > implementation > > > > > > > defers the question of how aggressively one should be preferred > > > over > > > > > the > > > > > > > other. > > > > > > > > > > > > > > @Dewey > > > > > > > > I don't see the C Data interface in the PR > > > > > > > > > > > > > > I have not addressed the C ABI in this PR. As you mention, it > may > > > be > > > > > > > useful to transmit > > > > > > > arrays with raw pointer views between implementations which > allow > > > > > them. I > > > > > > > can address > > > > > > > this in a follow up PR. > > > > > > > > > > > > > > @Will > > > > > > > > If I understand correctly, multiple arrays can reference the > > same > > > > > > buffers > > > > > > > > in memory, but once they are written to IPC their data > buffers > > > will > > > > > be > > > > > > > > duplicated. Is that right? > > > > > > > The buffers will be duplicated. If buffer duplication is > becomes > > a > > > > > > > concern, I'd prefer to handle > > > > > > > that in the ipc writer. Then buffers which are duplicated could > > be > > > > > > > detected by checking > > > > > > > pointer identity and written only once. > > > > > > > > > > > > > > > > > > > > > On Wed, May 17, 2023 at 12:07 AM Will Jones < > > > will.jones...@gmail.com > > > > > > > > > > > > wrote: > > > > > > > > > > > > > >> Hello Ben, > > > > > > >> > > > > > > >> Thanks for your work on this. I think this will be an > excellent > > > > > addition > > > > > > >> to > > > > > > >> the format. > > > > > > >> > > > > > > >> If I understand correctly, multiple arrays can reference the > > same > > > > > > buffers > > > > > > >> in memory, but once they are written to IPC their data buffers > > > will > > > > be > > > > > > >> duplicated. Is that right? > > > > > > >> > > > > > > >> Dictionary types have a special message so they can be reused > > > across > > > > > > >> batches and even fields. Did we consider adding a similar > > message > > > > for > > > > > > >> string view buffers? > > > > > > >> > > > > > > >> One relevant use case I'm curious about is substring > extraction. > > > For > > > > > > >> example, if I have a column of URIs and I create columns where > > > I've > > > > > > >> extracted substrings like the hostname, path, and a list > column > > of > > > > > query > > > > > > >> parameters, I'd like for those latter columns to be views into > > the > > > > URI > > > > > > >> buffers, rather than full copies. > > > > > > >> > > > > > > >> However, I've never touched the IPC read code paths, so it's > > quite > > > > > > >> possible > > > > > > >> I'm overlooking something obvious. > > > > > > >> > > > > > > >> Best, > > > > > > >> > > > > > > >> Will Jones > > > > > > >> > > > > > > >> > > > > > > >> On Tue, May 16, 2023 at 6:29 PM Dewey Dunnington > > > > > > >> <de...@voltrondata.com.invalid> wrote: > > > > > > >> > > > > > > >> > Very cool! > > > > > > >> > > > > > > > >> > In addition to performance mentioned above, I could see this > > > being > > > > > > >> > useful for the R bindings - we already have a global string > > pool > > > > > and a > > > > > > >> > mechanism for keeping a vector of them alive. > > > > > > >> > > > > > > > >> > I don't see the C Data interface in the PR although I may > have > > > > > missed > > > > > > >> > it - is that a part of the proposal? It seems like it would > be > > > > > > >> > possible to use raw pointers as long as they can be > guaranteed > > > to > > > > be > > > > > > >> > valid until the release callback is called? > > > > > > >> > > > > > > > >> > On Tue, May 16, 2023 at 8:43 PM Jacob Wujciak > > > > > > >> > <ja...@voltrondata.com.invalid> wrote: > > > > > > >> > > > > > > > > >> > > Hello Everyone, > > > > > > >> > > I think keeping interoperability with the large ecosystem > > is a > > > > > very > > > > > > >> > > important goal for arrow so I am overall in favor of this > > > > > proposal! > > > > > > >> > > > > > > > > >> > > You mention benchmarks multiple times, are these results > > > > published > > > > > > >> > > somewhere? > > > > > > >> > > > > > > > > >> > > Thanks > > > > > > >> > > > > > > > > >> > > On Tue, May 16, 2023 at 11:39 PM Benjamin Kietzman < > > > > > > >> bengil...@gmail.com> > > > > > > >> > > wrote: > > > > > > >> > > > > > > > > >> > > > Hello all, > > > > > > >> > > > > > > > > > >> > > > As previously discussed on this list [1], an > > > > > UmbraDB/DuckDB/Velox > > > > > > >> > > > compatible > > > > > > >> > > > "string view" type could bring several performance > > benefits > > > to > > > > > > >> access > > > > > > >> > and > > > > > > >> > > > authoring of string data in the arrow format [2]. > > > Additionally > > > > > > >> better > > > > > > >> > > > interoperability with engines already using this format > > > could > > > > be > > > > > > >> > > > established. > > > > > > >> > > > > > > > > > >> > > > PR #0 [3] adds Utf8View and BinaryView types to the C++ > > > > > > >> implementation > > > > > > >> > and > > > > > > >> > > > to > > > > > > >> > > > the IPC format. For the purposes of IPC raw pointers are > > not > > > > > used. > > > > > > >> > Instead, > > > > > > >> > > > each view contains a pair of 32 bit unsigned integers > > which > > > > > encode > > > > > > >> the > > > > > > >> > > > index of > > > > > > >> > > > a character buffer (string view arrays may consist of a > > > > variable > > > > > > >> > number of > > > > > > >> > > > such buffers) and the offset of a view's data within > that > > > > buffer > > > > > > >> > > > respectively. > > > > > > >> > > > Benefits of this substitution include: > > > > > > >> > > > - This makes explicit the guarantee that lifetime of all > > > > > character > > > > > > >> > data is > > > > > > >> > > > equal > > > > > > >> > > > to that of the array which views it, which is critical > > for > > > > > > >> confident > > > > > > >> > > > consumption across an interface boundary. > > > > > > >> > > > - As with other types in the arrow format, such arrays > are > > > > > > >> > serializable and > > > > > > >> > > > venue agnostic; directly usable in shared memory > without > > > > > > >> > modification. > > > > > > >> > > > - Indices and offsets are easily validated. > > > > > > >> > > > > > > > > > >> > > > Accessing the data requires some trivial pointer > > arithmetic, > > > > but > > > > > > in > > > > > > >> > > > benchmarking > > > > > > >> > > > this had negligible impact on sequential access and only > > > minor > > > > > > >> impact > > > > > > >> > on > > > > > > >> > > > random > > > > > > >> > > > access. > > > > > > >> > > > > > > > > > >> > > > In the C++ implementation, raw pointer string views are > > > > > supported > > > > > > >> as an > > > > > > >> > > > extended > > > > > > >> > > > case of the Utf8View type: > > > > > `utf8_view(/*has_raw_pointers=*/true)`. > > > > > > >> > > > Branching on > > > > > > >> > > > this access pattern bit at the data type level has > > > negligible > > > > > > >> impact on > > > > > > >> > > > performance since the branch resides outside any hot > > loops. > > > > > > Utility > > > > > > >> > > > functions > > > > > > >> > > > are provided for efficient (potentially in-place) > > conversion > > > > > > between > > > > > > >> > raw > > > > > > >> > > > pointer > > > > > > >> > > > and index offset views. For example, the C++ > > implementation > > > > > could > > > > > > >> zero > > > > > > >> > copy > > > > > > >> > > > a raw pointer array from Velox, filter it, then convert > to > > > > > > >> > index/offset for > > > > > > >> > > > serialization. Other implementations may choose to > > > accommodate > > > > > or > > > > > > >> > eschew > > > > > > >> > > > raw > > > > > > >> > > > pointer views as their communities direct. > > > > > > >> > > > > > > > > > >> > > > Where desirous in a rigorously controlled context this > > still > > > > > > enables > > > > > > >> > > > construction > > > > > > >> > > > and safe consumption of string view arrays which > reference > > > > > memory > > > > > > >> not > > > > > > >> > > > directly bound to the lifetime of the array. I'm not > sure > > > when > > > > > or > > > > > > >> if we > > > > > > >> > > > would > > > > > > >> > > > find it useful to have arrays like this; I do not > > introduce > > > > any > > > > > in > > > > > > >> > [3]. I > > > > > > >> > > > mention > > > > > > >> > > > this possibility to highlight that if benchmarking > > > > demonstrates > > > > > > that > > > > > > >> > such > > > > > > >> > > > an > > > > > > >> > > > approach brings a significant performance benefit to > some > > > > > > operation, > > > > > > >> > the > > > > > > >> > > > only > > > > > > >> > > > barrier to its adoption would be code review. Likewise > if > > > more > > > > > > >> > intensive > > > > > > >> > > > benchmarking determines that raw pointer views > critically > > > > > > outperform > > > > > > >> > > > index/offset > > > > > > >> > > > views for real-world analytics tasks, prioritizing raw > > > pointer > > > > > > >> string > > > > > > >> > views > > > > > > >> > > > for usage within the C++ implementation will be > > > > straightforward. > > > > > > >> > > > > > > > > > >> > > > See also the proposal to Velox that their string view > > vector > > > > be > > > > > > >> > refactored > > > > > > >> > > > in a similar vein [4]. > > > > > > >> > > > > > > > > > >> > > > Sincerely, > > > > > > >> > > > Ben Kietzman > > > > > > >> > > > > > > > > > >> > > > [1] > > > > > > >> > > https://lists.apache.org/thread/49qzofswg1r5z7zh39pjvd1m2ggz2kdq > > > > > > >> > > > [2] > > > http://cidrdb.org/cidr2020/papers/p29-neumann-cidr20.pdf > > > > > > >> > > > [3] https://github.com/apache/arrow/pull/35628 > > > > > > >> > > > [4] > > > > https://github.com/facebookincubator/velox/discussions/4362 > > > > > > >> > > > > > > > > > >> > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > > > > > > > >