Re: [DISCUSS][C++] Raw pointer string views

2023-10-07 Thread Andrew Lamb
Given the discussion on this thread, I think the best thing we could do is 1. Do not change the Arrow spec / C++ implementation (do not add raw pointers) 2. Abandon the goal of "truly zero copy" interchange with Velox and DuckDB as unobtainable 3. Focus our efforts as a community to drive the new a

Re: [DISCUSS][C++] Raw pointer string views

2023-10-06 Thread Neal Richardson
Agreed, it's unfortunately not just a simple tradeoff. We have discussed this a bit in [1] and in several other threads around this topic. If we say that Arrow is about interchange and not execution, so we shouldn't adopt the pointer version that DuckDB uses, that means we're also making interchang

Re: [DISCUSS][C++] Raw pointer string views

2023-10-06 Thread Weston Pace
> I feel the broader question here is what is Arrow's intended use case - interchange or execution The line between interchange and execution is not always clear. For example, I think we would like Arrow to be considered as a standard for UDF libraries. On Fri, Oct 6, 2023 at 7:34 AM Mark Raasve

Re: [DISCUSS][C++] Raw pointer string views

2023-10-06 Thread Mark Raasveldt
For the index vs pointer question - DuckDB went with pointers as they are more flexible, and DuckDB was designed to consume data (and strings) from a wide variety of formats in a wide variety of languages. Pointers allows us to easily zero-copy from e.g. Python strings, R strings, Arrow strings,

Re: [DISCUSS][C++] Raw pointer string views

2023-10-06 Thread Andrew Lamb
Given I don't see any input from the DuckDB / Velox development team (this discussion seems primarily Arrow developers) I have filed a ticket in DuckDB requesting their consideration[1] and tried to bump the attention of the existing ticket in Velox[2]. Perhaps their input will provide a way forwar

Re: [DISCUSS][C++] Raw pointer string views

2023-10-03 Thread Antoine Pitrou
Le 03/10/2023 à 01:36, Matt Topol a écrit : The cost of conversion is actually significantly higher than the actual overhead of simply accessing the values in either representation, leading to a high potential for bottleneck. For systems like Velox and DuckDB where it's important to be able to

Re: [DISCUSS][C++] Raw pointer string views

2023-10-02 Thread Matt Topol
Given the benchmarks that Ben provided, I think I still have one concern if we only support the offset-based representation: @Raphael: > Conversion between the two view representations is relatively fast, especially for small strings I think this is a bit of an oversimplification given Ben's ass

Re: [DISCUSS][C++] Raw pointer string views

2023-10-02 Thread Raphael Taylor-Davies
Thank you for this, my major takeaways are: - The performance characteristics of the two view representations are broadly equivalent, with an extremely minor edge to the pointer representation - Both view types represent a significant performance win over converting to the non-view representat

Re: [DISCUSS][C++] Raw pointer string views

2023-10-02 Thread Benjamin Kietzman
@Antoine > By construction, a standard cannot continuously chase the performance state of > art, it has to weigh the benefits of performance improvements against the > increased cost for the ecosystem. > We have extension types which could reasonably be used for non-standard > data types, especial

Re: [DISCUSS][C++] Raw pointer string views

2023-10-02 Thread Andrew Lamb
> I don't think "we have to adjust the Arrow format so that existing > internal representations become Arrow-compliant without any > (re-)implementation effort" is a reasonable design principle. I agree with this statement from Antoine -- given the Arrow community has standardized an addition to t

Re: [DISCUSS][C++] Raw pointer string views

2023-10-02 Thread Raphael Taylor-Davies
Oh I'm with you on it being a precedent we want to be very careful about setting, but if there isn't a meaningful performance difference, we may be able to sidestep that discussion entirely. On 02/10/2023 14:11, Antoine Pitrou wrote: Even if performance were significant better, I don't think

Re: [DISCUSS][C++] Raw pointer string views

2023-10-02 Thread Antoine Pitrou
Even if performance were significant better, I don't think it's a good enough reason to add these representations to Arrow. By construction, a standard cannot continuously chase the performance state of art, it has to weigh the benefits of performance improvements against the increased cost

Re: [DISCUSS][C++] Raw pointer string views

2023-10-02 Thread Raphael Taylor-Davies
I think what would really help would be some concrete numbers, do we have any numbers comparing the performance of the offset and pointer based representations? If there isn't a significant performance difference between them, would the systems that currently use a pointer-based approach be wil

Re: [DISCUSS][C++] Raw pointer string views

2023-10-02 Thread Antoine Pitrou
Le 01/10/2023 à 16:21, Micah Kornfield a écrit : I would also assert that another way to reduce this risk is to add some prose to the relevant sections of the columnar format specification doc to clearly explain that a raw pointers variant of the layout, while not part of the official spec, ma

Re: [DISCUSS][C++] Raw pointer string views

2023-10-01 Thread Micah Kornfield
> > I would also assert that another way to reduce this risk is to add > some prose to the relevant sections of the columnar format > specification doc to clearly explain that a raw pointers variant of > the layout, while not part of the official spec, may be implemented in > some Arrow libraries.

Re: [DISCUSS][C++] Raw pointer string views

2023-09-29 Thread Ian Cook
I strongly agree with Ben's assertion that "the risk of a parallel ecosystem… is more likely to be provoked by excluding a user's vital use case [than by implementing support for an unofficial layout variant]" in the C++ library. But there seems to be a consensus here that there is a real risk of s

Re: [DISCUSS][C++] Raw pointer string views

2023-09-28 Thread Felipe Oliveira Carvalho
My take here is that Ben did an excellent job in hiding the fact that C++ has two variations of the format without leaking the pointer version via the interfaces through which Arrow arrays are communicated to other implementations. As things stand right now, there is no zero-copy transfer of point

Re: [DISCUSS][C++] Raw pointer string views

2023-09-28 Thread Benjamin Kietzman
@Wes 3. Implement the raw pointer variant as an extension type in C++ / C ABI. @Andrew 1. Update the standard to allow raw pointers If adding raw pointers to the C ABI is a satisfactory compromise, then I'd be happy to draft a PR adding it. To me this seems to cover the bases of accommodating ei

Re: [DISCUSS][C++] Raw pointer string views

2023-09-28 Thread Andrew Lamb
> What this PR is creating is an "unofficial" Arrow format, with data types exposed in Arrow C++ that are not part of the Arrow standard, but are exposed as if they were. I agree with Antoine here. It seems a pretty clear cut story of the C++ implementation doesn't follow the spec and thus we shou

Re: [DISCUSS][C++] Raw pointer string views

2023-09-28 Thread Raphael Taylor-Davies
FWIW Rust wouldn't have issues using raw pointers, I can't speak for other languages though. They would be more expensive to validate, but validation is not going to be cheap regardless. I could definitely see a world where view types use pointers and IPC coerces to/from the large non-view type

Re: [DISCUSS][C++] Raw pointer string views

2023-09-28 Thread Wes McKinney
hi all, I'm just catching up on this thread after having taken a look at the format PRs, the C++ implementation PR, and this e-mail thread. So only my $0.02 from having spent a great deal less time on this project than others. The original motivation I had for bringing up the idea of adding the S

Re: [DISCUSS][C++] Raw pointer string views

2023-09-28 Thread Antoine Pitrou
To make things clear, any of the factory functions listed below create a type that maps exactly onto an Arrow columnar layout: https://arrow.apache.org/docs/dev/cpp/api/datatype.html#factory-functions For example, calling `arrow::dictionary` creates a dictionary type that exactly represents

Re: [DISCUSS][C++] Raw pointer string views

2023-09-28 Thread Antoine Pitrou
Hi Ben, Le 27/09/2023 à 23:25, Benjamin Kietzman a écrit : @Antoine What this PR is creating is an "unofficial" Arrow format, with data types exposed in Arrow C++ that are not part of the Arrow standard, but are exposed as if they were. We already do this in every implementation of the arr

Re: [DISCUSS][C++] Raw pointer string views

2023-09-27 Thread Aldrin
Out of curiosity, does any of the hash kernels using raw pointers address your question Raphael?I haven't looked at the original PR so I am likely missing context. My impression is that there is pushback against kernels that use StringView with a raw pointer type? Since the hashing functions con

Re: [DISCUSS][C++] Raw pointer string views

2023-09-27 Thread Raphael Taylor-Davies
Do you have any benchmarks comparing kernels with native pointer array support, compared to those that must first convert to the offset representation? I think this would help ground this discussion empirically. On 27 September 2023 22:25:02 BST, Benjamin Kietzman wrote: >Hello all, > >@Gang >

Re: [DISCUSS][C++] Raw pointer string views

2023-09-27 Thread Benjamin Kietzman
Hello all, @Gang > Could you please simply describe the layout of DuckDB and Velox Arrow represents long (>12 bytes) strings with a view which includes a buffer index (used to look up one of the variadic data buffers) and an offset (used to find the start of a string's bytes within the indicated

Re: [DISCUSS][C++] Raw pointer string views

2023-09-26 Thread Antoine Pitrou
Hello, What this PR is creating is an "unofficial" Arrow format, with data types exposed in Arrow C++ that are not part of the Arrow standard, but are exposed as if they were. Most users will probably not read the official format spec, but will simply trust the official Arrow implementation

Re: [DISCUSS][C++] Raw pointer string views

2023-09-26 Thread Gang Wu
Could you please simply describe the layout of DuckDB and Velox so we can know what kind of conversion is required from the raw pointer variant? If any engine simply represents string array in the form of something like std::vector, should we provide a similar variant in C++ to minimize the convers

Re: [DISCUSS][C++] Raw pointer string views

2023-09-26 Thread Raphael Taylor-Davies
I'm confused why this would need to copy string data, assuming the pointers are into defined memory regions, something necessary for the C data interface's ownership semantics regardless, why can't these memory regions just be used as buffers as is? This would therefore require just rewriting th

Re: [DISCUSS][C++] Raw pointer string views

2023-09-26 Thread Matt Topol
I believe the motivation is to avoid the cost of the data copy that would have to happen to convert from a pointer based to offset based scenario. Allowing the pointer-based implementation will ensure that we can maintain zero-copy communication with both DuckDB and Velox in a common workflow scena

Re: [DISCUSS][C++] Raw pointer string views

2023-09-26 Thread Raphael Taylor-Davies
Hi, Is the motivation here to avoid DuckDB and Velox having to duplicate the conversion logic from pointer-based to offset-based, or to allow arrow-cpp to operate directly on pointer-based arrays? If it is the former, I personally wouldn't have thought the conversion logic sufficiently compl

[DISCUSS][C++] Raw pointer string views

2023-09-26 Thread Benjamin Kietzman
Hello all, In the PR to add support for Utf8View to the c++ implementation, I've taken the approach of allowing raw pointer views [1] alongside the index/offset views described in the spec [2]. This was done to ease communication with other engines such as DuckDB and Velox whose native string repr