Hello List,

we at DuckDB are happy users of the Arrow C Data Interface and use it to
feed SQL queries and also use it to provide query results in Arrow format
again. It is particularly appealing to us that the interface is merely a
(C) header file that we just ship with our source code [1]. Internally, our
implementation then constructs DuckDB internal vectors from the Arrow
format [2] or vice-versa [3].

As you can see from [2, 3] there is some complexity in getting the
conversion right, especially for more complex data types like nested types
(list, strings). A lightweight, dependency-free library to help
constructing those would certainly be appreciated. What would also help a
lot is validation code, Arrow structures are very delicate and one wrong
pointer can lead to disaster (which is then blamed on us), so a way to
verify the structures in said lightweight library would be very helpful.

Best from Amsterdam, and Quack

Hannes

[1]
https://github.com/duckdb/duckdb/blob/master/src/include/duckdb/common/arrow.hpp
[2]
https://github.com/duckdb/duckdb/blob/master/src/function/table/arrow.cpp
[3]
https://github.com/duckdb/duckdb/blob/master/src/common/types/data_chunk.cpp


On Fri, Jun 03, 2022 at 15:34:42, Jonathan Keane <jke...@gmail.com> wrote:

> cc Hannes Mühleisen from DuckDB Labs
>
> -Jon
>
>
> On Tue, May 31, 2022 at 5:03 PM Wes McKinney <wesmck...@gmail.com> wrote:
>
> I'm also supportive of having a small vendorable C/C++ "Arrow
> middleware" that provides:
>
> * Schemas and types
> * Columnar data structures and minimal APIs to build them and iterate over
> them
> * C data interface
> * Minimal validation (at the level of Validate but not ValidateFull)
>
> I don't think it's going to be practical to try to refactor parts of
> the existing Arrow C++ core to be vendorable since there are many
> features / requirements (e.g. an extensible buffer and device API)
> that these C++ classes include that aren't needed in this
> limited-feature middleware library.
>
> This also relates to the "Improving Arrow's database support" project
> that David Li raised some time ago [1]. If we want to encourage
> database driver libraries to add new APIs that emit the Arrow C
> interface, we need to make it easier to generate the C interface
> without requiring a new library dependency.
>
> [1]: https://lists.apache.org/thread/gnz1kz2rj3rb8rh8qz7l0mv8lvzq254w
>
> On Mon, May 30, 2022 at 11:31 AM Jonathan Keane <jke...@gmail.com> wrote:
> >
> > Thanks for working on this. I've heard people asking about something
> > like this from a number of different fronts on top of the obvious use
> > case in geoarrow | other geospatial libraries. I think a minimal piece
> > of Arrow that other packages could depend on without needing to bring
> > in all of arrow would be super valuable in building the bridges we
> > want across other systems.
> >
> > Do you have any (design) documentation that describes the scope of
> > what you're thinking? I know there have been others floating around
> > [1] [2] that were in a similar spirit.
> >
> > A few more questions I hope will spark more conversation: How do the
> > header files you linked in [3] overlap with these other efforts? Are
> > those headers something we could|should "just" PR into apache/arrow
> > and write up how to use them? If not what is the work to make them so
> > that they could be (the answer of course could be design something
> > else entirely and PR that!)?
> >
> > [1] https://github.com/paleolimbot/narrow
> > [2] https://paleolimbot.github.io/narrow/articles/why-narrow.html
> > [3] https://github.com/paleolimbot/geoarrow-cpp/tree/main/src/geoarrow/
> internal/arrow-hpp
> >
> > -Jon
> >
> > -Jon
> >
> >
> > On Wed, May 25, 2022 at 9:29 AM Dewey Dunnington <de...@voltrondata.com>
> wrote:
> > >
> > > I'm writing to gauge interest in a set of helpers in C and/or C++ for
> > > reading/exporting Arrow C Data interface structures. My use-case is
> > > building Arrow geospatial support in R [1], and while the set of
> helpers
> > > I've been using [2] has served the purpose of me writing about the
> > > opportunities for Arrow + geospatial [3], I would like to rewrite the
> > > prototype based on something developed by/with the Arrow community.
> > >
> > > Does a set of C/C++ helpers for Arrow C Data interface structures
> already
> > > exist? *Should* it exist?
> > >
> > > If it doesn't, what should the name/scope of that library be? The names
> > > 'nanoarrow', 'narrow', 'sparrow', and 'arrow-hpp' have all surfaced in
> my
> > > limited discussion of this so far. For the purpose of starting the
> > > discussion, I'll posit that the library should include helpers to
> > > allocate/destroy C Data interface structures, a schema metadata
> > > encoder/decoder, validation of a schema/array pair, and something like
> the
> > > ArrayBuilder C++ class.
> > >
> > > [1] https://lists.apache.org/thread/yb7p9wpg3k128njskhwj9j788opb67g7
> > > [2]
> > > https://github.com/paleolimbot/geoarrow-cpp/tree/main/src/geoarrow/
> internal/arrow-hpp
> > > [3]
> > > https://docs.google.com/document/d/
> 1A6e3XCerjhXVFHBDaoAlBBNFb2HG4RB9SVRpuBru7E4/edit?usp=sharing
>
>

Reply via email to