> * This is the first layout where the number of buffers depends on the data > and not the schema. I think this is the most architecturally significant > fact. I
I have spent some time reading the initial proposal -- thank you for that. I now understand what Weston was saying about the "variable numbers of buffers". I wonder if you considered restricting such arrays to a single buffer (so as to make them more similar to other arrow array types that have a fixed number of buffers)? On Tue, Jun 20, 2023 at 11:33 AM Weston Pace <weston.p...@gmail.com> wrote: > Before I say anything else I'll say that I am in favor of this new layout. > There is some existing literature on the idea (e.g. umbra) and your > benchmarks show some nice improvements. > > Compared to some of the other layouts we've discussed recently (REE, list > veiw) I do think this layout is more unique and fundamentally different. > Perhaps most fundamentally different: > > * This is the first layout where the number of buffers depends on the data > and not the schema. I think this is the most architecturally significant > fact. It does require a (backwards compatible) change to the IPC format > itself, beyond just adding new type codes. It also poses challenges in > places where we've assumed there will be at most 3 buffers (e.g. in > ArraySpan, though, as you have shown, we can work around this using a raw > pointers representation internally in those spots). > > I think you've done some great work to integrate this well with Arrow-C++ > and I'm convinced it can work. > > I would be interested in hearing some input from the Rust community. > > Ben, at one point there was some discussion that this might be a c-data > only type. However, I believe that was based on the raw pointers > representation. What you've proposed here, if I understand correctly, is > an index + offsets representation and it is suitable for IPC correct? > (e.g. I see that you have changes and examples in the IPC reader/writer) > > On Mon, Jun 19, 2023 at 7:17 AM Benjamin Kietzman <bengil...@gmail.com> > wrote: > > > Hi Gang, > > > > I'm not sure what you mean, sorry if my answers are off base: > > > > Parquet's ByteArray will be unaffected by the addition of the string view > > type; > > all arrow strings (arrow::Type::STRING, arrow::Type::LARGE_STRING, and > > with this patch arrow::Type::STRING_VIEW) are converted to ByteArrays > > during serialization to parquet [1]. > > > > If you mean that encoding of arrow::Type::STRING_VIEW will not be as fast > > as encoding of equivalent arrow::Type::STRING, that's something I haven't > > benchmarked so I can't answer definitively. I would expect it to be > faster > > than > > first converting STRING_VIEW->STRING then encoding to parquet; direct > > encoding avoids allocating and populating temporary buffers. Of course > this > > only applies to cases where you need to encode an array of STRING_VIEW to > > parquet- encoding of STRING to parquet will be unaffected. > > > > Sincerely, > > Ben > > > > [1] > > > > > https://github.com/bkietz/arrow/blob/46cf7e67766f0646760acefa4d2d01cdfead2d5d/cpp/src/parquet/encoding.cc#L166-L179 > > > > On Thu, Jun 15, 2023 at 10:34 PM Gang Wu <ust...@gmail.com> wrote: > > > > > Hi Ben, > > > > > > The posted benchmark [1] looks pretty good to me. However, I want to > > > raise a possible issue from the perspective of parquet-cpp. Parquet-cpp > > > uses a customized parquet::ByteArray type [2] for string/binary, I > would > > > expect some regression of conversions between parquet reader/writer > > > and the proposed string view array, especially when some strings use > > > short form and others use long form. > > > > > > [1] > > > > > > > > > https://github.com/apache/arrow/blob/41309de8dd91a9821873fc5f94339f0542ca0108/cpp/src/parquet/types.h#L575 > > > [2] https://github.com/apache/arrow/pull/35628#issuecomment-1583218617 > > > > > > Best, > > > Gang > > > > > > On Fri, Jun 16, 2023 at 3:58 AM Will Jones <will.jones...@gmail.com> > > > wrote: > > > > > > > Cool. Thanks for doing that! > > > > > > > > On Thu, Jun 15, 2023 at 12:40 Benjamin Kietzman <bengil...@gmail.com > > > > > > wrote: > > > > > > > > > I've added https://github.com/apache/arrow/issues/36112 to track > > > > > deduplication of buffers on write. > > > > > I don't think it would require modification of the IPC format. > > > > > > > > > > Ben > > > > > > > > > > On Thu, Jun 15, 2023 at 1:30 PM Matt Topol <zotthewiz...@gmail.com > > > > > > wrote: > > > > > > > > > > > Based on my understanding, in theory a buffer *could* be shared > > > within > > > > a > > > > > > batch since the flatbuffers message just uses an offset and > length > > to > > > > > > identify the buffers. > > > > > > > > > > > > That said, I don't believe any current implementation actually > does > > > > this > > > > > or > > > > > > takes advantage of this in any meaningful way. > > > > > > > > > > > > --Matt > > > > > > > > > > > > On Thu, Jun 15, 2023 at 1:00 PM Will Jones < > > will.jones...@gmail.com> > > > > > > wrote: > > > > > > > > > > > > > Hi Ben, > > > > > > > > > > > > > > It's exciting to see this move along. > > > > > > > > > > > > > > The buffers will be duplicated. If buffer duplication is > becomes > > a > > > > > > concern, > > > > > > > > I'd prefer to handle > > > > > > > > that in the ipc writer. Then buffers which are duplicated > could > > > be > > > > > > > detected > > > > > > > > by checking > > > > > > > > pointer identity and written only once. > > > > > > > > > > > > > > > > > > > > > Question: to be able to write buffer only once and reference in > > > > > multiple > > > > > > > arrays, does that require a change to the IPC format? Or is > > sharing > > > > > > buffers > > > > > > > within the same batch already allowed in the IPC format? > > > > > > > > > > > > > > Best, > > > > > > > > > > > > > > Will Jones > > > > > > > > > > > > > > On Thu, Jun 15, 2023 at 9:03 AM Benjamin Kietzman < > > > > bengil...@gmail.com > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > Hello again all, > > > > > > > > > > > > > > > > The PR [1] to add string view to the format and the C++ > > > > > implementation > > > > > > is > > > > > > > > hovering around passing CI and has been undrafted. > Furthermore, > > > > there > > > > > > is > > > > > > > > now also a PR [2] to add string view to the Go > implementation. > > > Code > > > > > > > review > > > > > > > > is underway for each PR and I'd like to move toward a vote > for > > > > > > > acceptance- > > > > > > > > are there any other preliminaries which I've neglected? > > > > > > > > > > > > > > > > To reiterate the answers to some past questions: > > > > > > > > - Benchmarks are added in the C++ PR [1] to demonstrate the > > > > > performance > > > > > > > of > > > > > > > > conversion between the various string formats. In addition, > > > there > > > > > are > > > > > > > > some benchmarks which demonstrate the performance gains > > > available > > > > > > with > > > > > > > > the new format [3]. > > > > > > > > - Adding string view to the C ABI is a natural follow up, but > > > > should > > > > > be > > > > > > > > handled independently. An issue has been added to track > that > > > > > > > > enhancement [4]. > > > > > > > > > > > > > > > > Sincerely, > > > > > > > > Ben Kietzman > > > > > > > > > > > > > > > > [1] https://github.com/apache/arrow/pull/35628 > > > > > > > > [2] https://github.com/apache/arrow/pull/35769 > > > > > > > > [3] > > > > > https://github.com/apache/arrow/pull/35628#issuecomment-1583218617 > > > > > > > > [4] https://github.com/apache/arrow/issues/36099 > > > > > > > > > > > > > > > > On Wed, May 17, 2023 at 12:53 PM Benjamin Kietzman < > > > > > > bengil...@gmail.com> > > > > > > > > wrote: > > > > > > > > > > > > > > > > > @Jacob > > > > > > > > > > You mention benchmarks multiple times, are these results > > > > > published > > > > > > > > > somewhere? > > > > > > > > > > > > > > > > > > I benchmarked the performance of raw pointer vs index > offset > > > > views > > > > > in > > > > > > > my > > > > > > > > > PR to velox, > > > > > > > > > I do intend to port them to my arrow PR but I haven't > gotten > > > > there > > > > > > yet. > > > > > > > > > Furthermore, it > > > > > > > > > seemed less urgent to me since coexistence of the two types > > in > > > > the > > > > > > c++ > > > > > > > > > implementation > > > > > > > > > defers the question of how aggressively one should be > > preferred > > > > > over > > > > > > > the > > > > > > > > > other. > > > > > > > > > > > > > > > > > > @Dewey > > > > > > > > > > I don't see the C Data interface in the PR > > > > > > > > > > > > > > > > > > I have not addressed the C ABI in this PR. As you mention, > it > > > may > > > > > be > > > > > > > > > useful to transmit > > > > > > > > > arrays with raw pointer views between implementations which > > > allow > > > > > > > them. I > > > > > > > > > can address > > > > > > > > > this in a follow up PR. > > > > > > > > > > > > > > > > > > @Will > > > > > > > > > > If I understand correctly, multiple arrays can reference > > the > > > > same > > > > > > > > buffers > > > > > > > > > > in memory, but once they are written to IPC their data > > > buffers > > > > > will > > > > > > > be > > > > > > > > > > duplicated. Is that right? > > > > > > > > > The buffers will be duplicated. If buffer duplication is > > > becomes > > > > a > > > > > > > > > concern, I'd prefer to handle > > > > > > > > > that in the ipc writer. Then buffers which are duplicated > > could > > > > be > > > > > > > > > detected by checking > > > > > > > > > pointer identity and written only once. > > > > > > > > > > > > > > > > > > > > > > > > > > > On Wed, May 17, 2023 at 12:07 AM Will Jones < > > > > > will.jones...@gmail.com > > > > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > >> Hello Ben, > > > > > > > > >> > > > > > > > > >> Thanks for your work on this. I think this will be an > > > excellent > > > > > > > addition > > > > > > > > >> to > > > > > > > > >> the format. > > > > > > > > >> > > > > > > > > >> If I understand correctly, multiple arrays can reference > the > > > > same > > > > > > > > buffers > > > > > > > > >> in memory, but once they are written to IPC their data > > buffers > > > > > will > > > > > > be > > > > > > > > >> duplicated. Is that right? > > > > > > > > >> > > > > > > > > >> Dictionary types have a special message so they can be > > reused > > > > > across > > > > > > > > >> batches and even fields. Did we consider adding a similar > > > > message > > > > > > for > > > > > > > > >> string view buffers? > > > > > > > > >> > > > > > > > > >> One relevant use case I'm curious about is substring > > > extraction. > > > > > For > > > > > > > > >> example, if I have a column of URIs and I create columns > > where > > > > > I've > > > > > > > > >> extracted substrings like the hostname, path, and a list > > > column > > > > of > > > > > > > query > > > > > > > > >> parameters, I'd like for those latter columns to be views > > into > > > > the > > > > > > URI > > > > > > > > >> buffers, rather than full copies. > > > > > > > > >> > > > > > > > > >> However, I've never touched the IPC read code paths, so > it's > > > > quite > > > > > > > > >> possible > > > > > > > > >> I'm overlooking something obvious. > > > > > > > > >> > > > > > > > > >> Best, > > > > > > > > >> > > > > > > > > >> Will Jones > > > > > > > > >> > > > > > > > > >> > > > > > > > > >> On Tue, May 16, 2023 at 6:29 PM Dewey Dunnington > > > > > > > > >> <de...@voltrondata.com.invalid> wrote: > > > > > > > > >> > > > > > > > > >> > Very cool! > > > > > > > > >> > > > > > > > > > >> > In addition to performance mentioned above, I could see > > this > > > > > being > > > > > > > > >> > useful for the R bindings - we already have a global > > string > > > > pool > > > > > > > and a > > > > > > > > >> > mechanism for keeping a vector of them alive. > > > > > > > > >> > > > > > > > > > >> > I don't see the C Data interface in the PR although I > may > > > have > > > > > > > missed > > > > > > > > >> > it - is that a part of the proposal? It seems like it > > would > > > be > > > > > > > > >> > possible to use raw pointers as long as they can be > > > guaranteed > > > > > to > > > > > > be > > > > > > > > >> > valid until the release callback is called? > > > > > > > > >> > > > > > > > > > >> > On Tue, May 16, 2023 at 8:43 PM Jacob Wujciak > > > > > > > > >> > <ja...@voltrondata.com.invalid> wrote: > > > > > > > > >> > > > > > > > > > > >> > > Hello Everyone, > > > > > > > > >> > > I think keeping interoperability with the large > > ecosystem > > > > is a > > > > > > > very > > > > > > > > >> > > important goal for arrow so I am overall in favor of > > this > > > > > > > proposal! > > > > > > > > >> > > > > > > > > > > >> > > You mention benchmarks multiple times, are these > results > > > > > > published > > > > > > > > >> > > somewhere? > > > > > > > > >> > > > > > > > > > > >> > > Thanks > > > > > > > > >> > > > > > > > > > > >> > > On Tue, May 16, 2023 at 11:39 PM Benjamin Kietzman < > > > > > > > > >> bengil...@gmail.com> > > > > > > > > >> > > wrote: > > > > > > > > >> > > > > > > > > > > >> > > > Hello all, > > > > > > > > >> > > > > > > > > > > > >> > > > As previously discussed on this list [1], an > > > > > > > UmbraDB/DuckDB/Velox > > > > > > > > >> > > > compatible > > > > > > > > >> > > > "string view" type could bring several performance > > > > benefits > > > > > to > > > > > > > > >> access > > > > > > > > >> > and > > > > > > > > >> > > > authoring of string data in the arrow format [2]. > > > > > Additionally > > > > > > > > >> better > > > > > > > > >> > > > interoperability with engines already using this > > format > > > > > could > > > > > > be > > > > > > > > >> > > > established. > > > > > > > > >> > > > > > > > > > > > >> > > > PR #0 [3] adds Utf8View and BinaryView types to the > > C++ > > > > > > > > >> implementation > > > > > > > > >> > and > > > > > > > > >> > > > to > > > > > > > > >> > > > the IPC format. For the purposes of IPC raw pointers > > are > > > > not > > > > > > > used. > > > > > > > > >> > Instead, > > > > > > > > >> > > > each view contains a pair of 32 bit unsigned > integers > > > > which > > > > > > > encode > > > > > > > > >> the > > > > > > > > >> > > > index of > > > > > > > > >> > > > a character buffer (string view arrays may consist > of > > a > > > > > > variable > > > > > > > > >> > number of > > > > > > > > >> > > > such buffers) and the offset of a view's data within > > > that > > > > > > buffer > > > > > > > > >> > > > respectively. > > > > > > > > >> > > > Benefits of this substitution include: > > > > > > > > >> > > > - This makes explicit the guarantee that lifetime of > > all > > > > > > > character > > > > > > > > >> > data is > > > > > > > > >> > > > equal > > > > > > > > >> > > > to that of the array which views it, which is > > critical > > > > for > > > > > > > > >> confident > > > > > > > > >> > > > consumption across an interface boundary. > > > > > > > > >> > > > - As with other types in the arrow format, such > arrays > > > are > > > > > > > > >> > serializable and > > > > > > > > >> > > > venue agnostic; directly usable in shared memory > > > without > > > > > > > > >> > modification. > > > > > > > > >> > > > - Indices and offsets are easily validated. > > > > > > > > >> > > > > > > > > > > > >> > > > Accessing the data requires some trivial pointer > > > > arithmetic, > > > > > > but > > > > > > > > in > > > > > > > > >> > > > benchmarking > > > > > > > > >> > > > this had negligible impact on sequential access and > > only > > > > > minor > > > > > > > > >> impact > > > > > > > > >> > on > > > > > > > > >> > > > random > > > > > > > > >> > > > access. > > > > > > > > >> > > > > > > > > > > > >> > > > In the C++ implementation, raw pointer string views > > are > > > > > > > supported > > > > > > > > >> as an > > > > > > > > >> > > > extended > > > > > > > > >> > > > case of the Utf8View type: > > > > > > > `utf8_view(/*has_raw_pointers=*/true)`. > > > > > > > > >> > > > Branching on > > > > > > > > >> > > > this access pattern bit at the data type level has > > > > > negligible > > > > > > > > >> impact on > > > > > > > > >> > > > performance since the branch resides outside any hot > > > > loops. > > > > > > > > Utility > > > > > > > > >> > > > functions > > > > > > > > >> > > > are provided for efficient (potentially in-place) > > > > conversion > > > > > > > > between > > > > > > > > >> > raw > > > > > > > > >> > > > pointer > > > > > > > > >> > > > and index offset views. For example, the C++ > > > > implementation > > > > > > > could > > > > > > > > >> zero > > > > > > > > >> > copy > > > > > > > > >> > > > a raw pointer array from Velox, filter it, then > > convert > > > to > > > > > > > > >> > index/offset for > > > > > > > > >> > > > serialization. Other implementations may choose to > > > > > accommodate > > > > > > > or > > > > > > > > >> > eschew > > > > > > > > >> > > > raw > > > > > > > > >> > > > pointer views as their communities direct. > > > > > > > > >> > > > > > > > > > > > >> > > > Where desirous in a rigorously controlled context > this > > > > still > > > > > > > > enables > > > > > > > > >> > > > construction > > > > > > > > >> > > > and safe consumption of string view arrays which > > > reference > > > > > > > memory > > > > > > > > >> not > > > > > > > > >> > > > directly bound to the lifetime of the array. I'm not > > > sure > > > > > when > > > > > > > or > > > > > > > > >> if we > > > > > > > > >> > > > would > > > > > > > > >> > > > find it useful to have arrays like this; I do not > > > > introduce > > > > > > any > > > > > > > in > > > > > > > > >> > [3]. I > > > > > > > > >> > > > mention > > > > > > > > >> > > > this possibility to highlight that if benchmarking > > > > > > demonstrates > > > > > > > > that > > > > > > > > >> > such > > > > > > > > >> > > > an > > > > > > > > >> > > > approach brings a significant performance benefit to > > > some > > > > > > > > operation, > > > > > > > > >> > the > > > > > > > > >> > > > only > > > > > > > > >> > > > barrier to its adoption would be code review. > Likewise > > > if > > > > > more > > > > > > > > >> > intensive > > > > > > > > >> > > > benchmarking determines that raw pointer views > > > critically > > > > > > > > outperform > > > > > > > > >> > > > index/offset > > > > > > > > >> > > > views for real-world analytics tasks, prioritizing > raw > > > > > pointer > > > > > > > > >> string > > > > > > > > >> > views > > > > > > > > >> > > > for usage within the C++ implementation will be > > > > > > straightforward. > > > > > > > > >> > > > > > > > > > > > >> > > > See also the proposal to Velox that their string > view > > > > vector > > > > > > be > > > > > > > > >> > refactored > > > > > > > > >> > > > in a similar vein [4]. > > > > > > > > >> > > > > > > > > > > > >> > > > Sincerely, > > > > > > > > >> > > > Ben Kietzman > > > > > > > > >> > > > > > > > > > > > >> > > > [1] > > > > > > > > >> > > > > https://lists.apache.org/thread/49qzofswg1r5z7zh39pjvd1m2ggz2kdq > > > > > > > > >> > > > [2] > > > > > http://cidrdb.org/cidr2020/papers/p29-neumann-cidr20.pdf > > > > > > > > >> > > > [3] https://github.com/apache/arrow/pull/35628 > > > > > > > > >> > > > [4] > > > > > > https://github.com/facebookincubator/velox/discussions/4362 > > > > > > > > >> > > > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >