> I feel the broader question here is what is Arrow's intended use case - interchange or execution
The line between interchange and execution is not always clear. For example, I think we would like Arrow to be considered as a standard for UDF libraries. On Fri, Oct 6, 2023 at 7:34 AM Mark Raasveldt <m...@duckdblabs.com> wrote: > For the index vs pointer question - DuckDB went with pointers as they are > more flexible, and DuckDB was designed to consume data (and strings) from a > wide variety of formats in a wide variety of languages. Pointers allows us > to easily zero-copy from e.g. Python strings, R strings, Arrow strings, > etc. The flip side of pointers is that ownership has to be handled very > carefully. Our vector format is an execution-only format, and never leaves > the internals of the engine. This greatly simplifies ownership as we are in > complete control of what happens inside the engine. For an interchange > format that is intended for handing data between engines, I can see this > being more complicated and having verification being more important. > > As for the actual change: > > From an interchange perspective from DuckDB's side - the proposed > zero-copy integration would definitely speed up the conversion of DuckDB > string vectors to Arrow string vectors. In a recent benchmark that we have > performed we have found string conversion to Arrow vectors to be a > bottleneck in certain workloads, although we have not sufficiently > researched if this could be improved in other ways. It is possible this can > be alleviated without requiring changes to Arrow. > > However - in general, a new string vector format is only useful if > consumers also support the format. If the consumer immediately converts the > strings back into the standard Arrow string representation then there is no > benefit. The change will only move where the conversion happens (from > inside DuckDB to inside the consumer). As such, this change is only useful > if the broader Arrow ecosystem moves towards supporting the new string > format. > > From an execution perspective from DuckDB's side - it is unlikely that we > will switch to using Arrow as an internal format at this stage of the > project. While this change increases Arrow's utility as an intermediate > execution format, that is more relevant to projects that are currently > using Arrow in this manner or are planning to use Arrow in this manner. > > I feel the broader question here is what is Arrow's intended use case - > interchange or execution - as they are opposed in this discussion. This > change improves Arrow's utility as an execution format at the expense of > more stability in the interchange format. From my perspective Arrow is more > useful as an interchange format. When different tools communicate with each > other a standard is required. An execution format is generally not exposed > outside of the internals of the execution engine. Engines can do whatever > they want here - and a standard is perhaps not as useful. > > On 2023/10/02 13:21:59 Andrew Lamb wrote: > > > I don't think "we have to adjust the Arrow format so that existing > > > internal representations become Arrow-compliant without any > > > (re-)implementation effort" is a reasonable design principle. > > > > I agree with this statement from Antoine -- given the Arrow community has > > standardized an addition to the format with StringView, I think it would > > help to get some input from those at DuckDB and Velox on their > perspective > > > > Andrew > > > > > > > > > > On Mon, Oct 2, 2023 at 9:17 AM Raphael Taylor-Davies > > <r....@googlemail.com.invalid> wrote: > > > > > Oh I'm with you on it being a precedent we want to be very careful > about > > > setting, but if there isn't a meaningful performance difference, we may > > > be able to sidestep that discussion entirely. > > > > > > On 02/10/2023 14:11, Antoine Pitrou wrote: > > > > > > > > Even if performance were significant better, I don't think it's a > good > > > > enough reason to add these representations to Arrow. By construction, > > > > a standard cannot continuously chase the performance state of art, it > > > > has to weigh the benefits of performance improvements against the > > > > increased cost for the ecosystem (for example the cost of adapting to > > > > frequent standard changes and a growing standard size). > > > > > > > > We have extension types which could reasonably be used for > > > > non-standard data types, especially the kind that are motivated by > > > > leading-edge performance research and innovation and come with > unusual > > > > constraints (such as requiring trusting and dereferencing raw > pointers > > > > embedded in data buffers). There could even be an argument for making > > > > some of them canonical extension types if there's enough anteriority > > > > in favor. > > > > > > > > Regards > > > > > > > > Antoine. > > > > > > > > > > > > Le 02/10/2023 à 15:00, Raphael Taylor-Davies a écrit : > > > >> I think what would really help would be some concrete numbers, do we > > > >> have any numbers comparing the performance of the offset and pointer > > > >> based representations? If there isn't a significant performance > > > >> difference between them, would the systems that currently use a > > > >> pointer-based approach be willing to meet us in the middle and > switch to > > > >> an offset based encoding? This to me feels like it would be the best > > > >> outcome for the ecosystem as a whole. > > > >> > > > >> Kind Regards, > > > >> > > > >> Raphael > > > >> > > > >> On 02/10/2023 13:50, Antoine Pitrou wrote: > > > >>> > > > >>> Le 01/10/2023 à 16:21, Micah Kornfield a écrit : > > > >>>>> > > > >>>>> I would also assert that another way to reduce this risk is to > add > > > >>>>> some prose to the relevant sections of the columnar format > > > >>>>> specification doc to clearly explain that a raw pointers variant > of > > > >>>>> the layout, while not part of the official spec, may be > > > >>>>> implemented in > > > >>>>> some Arrow libraries. > > > >>>> > > > >>>> I've lost a little context but on all the concerns of adding raw > > > >>>> pointers > > > >>>> as an official option to the spec. But I see making raw-pointer > > > >>>> variants > > > >>>> the best path forward. > > > >>>> > > > >>>> Things captured from this thread or seem obvious at least to me: > > > >>>> 1. Divergence of IPC spec from in-memory/C-ABI spec? > > > >>>> 2. More parts of the spec to cover. > > > >>>> 3. In-compatibility with some languages > > > >>>> 4. Validation (in my mind different use-cases require different > > > >>>> levels of > > > >>>> validation, so this is a little bit less of a concern in my mind). > > > >>>> > > > >>>> I think the broader issue is how we think about compatibility with > > > >>>> other > > > >>>> systems. For instance, what happens if Velox and DuckDb start > adding > > > >>>> new > > > >>>> divergent memory layouts? Are we expecting to add them to the > spec? > > > >>> > > > >>> This is a slippery slope. The more Arrow has a policy of > integrating > > > >>> existing practices simply because they exist, the more the Arrow > > > >>> format will become _à la carte_, with different implementations > > > >>> choosing to implement whatever they want to spend their engineering > > > >>> effort on (you can see this occur, in part, on the Parquet format > with > > > >>> its many different encodings, compression algorithms and a 96-bit > > > >>> timestamp type). > > > >>> > > > >>> We _have_ to think carefully about the middle- and long-term > future of > > > >>> the format when adopting new features. > > > >>> > > > >>> In this instance, we are doing a large part of the effort by > adopting > > > >>> a string view format with variadic buffers, inlined prefixes and > > > >>> offset-based views into those buffers. But some implementations > with > > > >>> historically different internal representations will have to share > > > >>> part of the effort to align with the newly standardized format. > > > >>> > > > >>> I don't think "we have to adjust the Arrow format so that existing > > > >>> internal representations become Arrow-compliant without any > > > >>> (re-)implementation effort" is a reasonable design principle. > > > >>> > > > >>> Regards > > > >>> > > > >>> Antoine. > > > > >