Given the discussion on this thread, I think the best thing we could do is
1. Do not change the Arrow spec / C++ implementation (do not add raw
pointers)
2. Abandon the goal of "truly zero copy" interchange with Velox and DuckDB
as unobtainable
3. Focus our efforts as a community to drive the new a
Agreed, it's unfortunately not just a simple tradeoff. We have discussed
this a bit in [1] and in several other threads around this topic. If we say
that Arrow is about interchange and not execution, so we shouldn't adopt
the pointer version that DuckDB uses, that means we're also making
interchang
> I feel the broader question here is what is Arrow's intended use case -
interchange or execution
The line between interchange and execution is not always clear. For
example, I think we would like Arrow to be considered as a standard for UDF
libraries.
On Fri, Oct 6, 2023 at 7:34 AM Mark Raasve
For the index vs pointer question - DuckDB went with pointers as they are more
flexible, and DuckDB was designed to consume data (and strings) from a wide
variety of formats in a wide variety of languages. Pointers allows us to easily
zero-copy from e.g. Python strings, R strings, Arrow strings,
Given I don't see any input from the DuckDB / Velox development team (this
discussion seems primarily Arrow developers) I have filed a ticket in
DuckDB requesting their consideration[1] and tried to bump the attention of
the existing ticket in Velox[2]. Perhaps their input will provide a way
forwar
Le 03/10/2023 à 01:36, Matt Topol a écrit :
The cost of conversion is actually significantly higher than the actual
overhead of simply accessing the values in either representation, leading
to a high potential for bottleneck. For systems like Velox and DuckDB where
it's important to be able to
Given the benchmarks that Ben provided, I think I still have one concern if
we only support the offset-based representation:
@Raphael:
> Conversion between the two view representations is relatively fast,
especially for small strings
I think this is a bit of an oversimplification given Ben's ass
Thank you for this, my major takeaways are:
- The performance characteristics of the two view representations are
broadly equivalent, with an extremely minor edge to the pointer
representation
- Both view types represent a significant performance win over
converting to the non-view representat
@Antoine
> By construction, a standard cannot continuously chase the performance
state of
> art, it has to weigh the benefits of performance improvements against the
> increased cost for the ecosystem.
> We have extension types which could reasonably be used for non-standard
> data types, especial
> I don't think "we have to adjust the Arrow format so that existing
> internal representations become Arrow-compliant without any
> (re-)implementation effort" is a reasonable design principle.
I agree with this statement from Antoine -- given the Arrow community has
standardized an addition to t
Oh I'm with you on it being a precedent we want to be very careful about
setting, but if there isn't a meaningful performance difference, we may
be able to sidestep that discussion entirely.
On 02/10/2023 14:11, Antoine Pitrou wrote:
Even if performance were significant better, I don't think
Even if performance were significant better, I don't think it's a good
enough reason to add these representations to Arrow. By construction, a
standard cannot continuously chase the performance state of art, it has
to weigh the benefits of performance improvements against the increased
cost
I think what would really help would be some concrete numbers, do we
have any numbers comparing the performance of the offset and pointer
based representations? If there isn't a significant performance
difference between them, would the systems that currently use a
pointer-based approach be wil
Le 01/10/2023 à 16:21, Micah Kornfield a écrit :
I would also assert that another way to reduce this risk is to add
some prose to the relevant sections of the columnar format
specification doc to clearly explain that a raw pointers variant of
the layout, while not part of the official spec, ma
>
> I would also assert that another way to reduce this risk is to add
> some prose to the relevant sections of the columnar format
> specification doc to clearly explain that a raw pointers variant of
> the layout, while not part of the official spec, may be implemented in
> some Arrow libraries.
I strongly agree with Ben's assertion that "the risk of a parallel
ecosystem… is more likely to be provoked by excluding a user's vital
use case [than by implementing support for an unofficial layout
variant]" in the C++ library. But there seems to be a consensus here
that there is a real risk of s
My take here is that Ben did an excellent job in hiding the fact that C++
has two variations of the format without leaking the pointer version via
the interfaces through which Arrow arrays are communicated to other
implementations.
As things stand right now, there is no zero-copy transfer of point
@Wes
3. Implement the raw pointer variant as an extension type in C++ / C ABI.
@Andrew
1. Update the standard to allow raw pointers
If adding raw pointers to the C ABI is a satisfactory
compromise, then I'd be happy to draft a PR adding it. To me this seems
to cover the bases of accommodating ei
> What this PR is creating is an "unofficial" Arrow format, with data
types exposed in Arrow C++ that are not part of the Arrow standard, but
are exposed as if they were.
I agree with Antoine here. It seems a pretty clear cut story of the C++
implementation doesn't follow the spec and thus we shou
FWIW Rust wouldn't have issues using raw pointers, I can't speak for other
languages though. They would be more expensive to validate, but validation is
not going to be cheap regardless.
I could definitely see a world where view types use pointers and IPC coerces
to/from the large non-view type
hi all,
I'm just catching up on this thread after having taken a look at the format
PRs, the C++ implementation PR, and this e-mail thread. So only my $0.02
from having spent a great deal less time on this project than others.
The original motivation I had for bringing up the idea of adding the
S
To make things clear, any of the factory functions listed below create a
type that maps exactly onto an Arrow columnar layout:
https://arrow.apache.org/docs/dev/cpp/api/datatype.html#factory-functions
For example, calling `arrow::dictionary` creates a dictionary type that
exactly represents
Hi Ben,
Le 27/09/2023 à 23:25, Benjamin Kietzman a écrit :
@Antoine
What this PR is creating is an "unofficial" Arrow format, with data
types exposed in Arrow C++ that are not part of the Arrow standard, but
are exposed as if they were.
We already do this in every implementation of the arr
Out of curiosity, does any of the hash kernels using raw pointers address your question Raphael?I haven't looked at the original PR so I am likely missing context. My impression is that there is pushback against kernels that use StringView with a raw pointer type? Since the hashing functions con
Do you have any benchmarks comparing kernels with native pointer array support,
compared to those that must first convert to the offset representation? I think
this would help ground this discussion empirically.
On 27 September 2023 22:25:02 BST, Benjamin Kietzman
wrote:
>Hello all,
>
>@Gang
>
Hello all,
@Gang
> Could you please simply describe the layout of DuckDB and Velox
Arrow represents long (>12 bytes) strings with a view which includes
a buffer index (used to look up one of the variadic data buffers)
and an offset (used to find the start of a string's bytes within the
indicated
Hello,
What this PR is creating is an "unofficial" Arrow format, with data
types exposed in Arrow C++ that are not part of the Arrow standard, but
are exposed as if they were. Most users will probably not read the
official format spec, but will simply trust the official Arrow
implementation
Could you please simply describe the layout of DuckDB and Velox
so we can know what kind of conversion is required from the raw
pointer variant? If any engine simply represents string array in the
form of something like std::vector, should we
provide a similar variant in C++ to minimize the convers
I'm confused why this would need to copy string data, assuming the pointers are
into defined memory regions, something necessary for the C data interface's
ownership semantics regardless, why can't these memory regions just be used as
buffers as is? This would therefore require just rewriting th
I believe the motivation is to avoid the cost of the data copy that would
have to happen to convert from a pointer based to offset based scenario.
Allowing the pointer-based implementation will ensure that we can maintain
zero-copy communication with both DuckDB and Velox in a common workflow
scena
Hi,
Is the motivation here to avoid DuckDB and Velox having to duplicate the
conversion logic from pointer-based to offset-based, or to allow
arrow-cpp to operate directly on pointer-based arrays?
If it is the former, I personally wouldn't have thought the conversion
logic sufficiently compl
Hello all,
In the PR to add support for Utf8View to the c++ implementation,
I've taken the approach of allowing raw pointer views [1] alongside the
index/offset views described in the spec [2]. This was done to ease
communication with other engines such as DuckDB and Velox whose native
string repr
32 matches
Mail list logo