Re: [DISC] Improving Arrow's database support

David Li Thu, 25 Aug 2022 08:52:02 -0700

Fair enough, thank you. I'll try to expand a bit. (Sorry for the wall of text 
that follows…)

These are the components:

- Core adbc.h header
- Driver manager for C/C++
- Flight SQL-based driver
- Postgres-based driver (WIP)
- SQLite-based driver (more of a testbed for me than an actual component - I 
don't think we'd actually distribute this)
- Java core interfaces
- Java driver manager
- Java JDBC-based driver
- Java Flight SQL-based driver
- Python driver manager

I think: adbc.h gets mirrored into the Arrow repo. The Flight SQL drivers get 
moved to the main Arrow repo and distributed as part of the regular Arrow 
releases.

For the rest of the components: they could be packaged individually, but 
versioned and released together. Also, each C/C++ driver probably needs a 
corresponding Python package so Python users do not have to futz with shared 
library configurations. (See [1].) So for instance, installing PyArrow would 
also give you the Flight SQL driver, and `pip install adbc_postgres` would get 
you the Postgres-based driver.

That would mean setting up separate CI, release, etc. (and eventually linking 
Crossbow & Conbench as well?). That does mean duplication of effort, but the 
trade off is avoiding bloating the main release process even further. However, 
I'd like to hear from those closer to the release process on this subject - if 
it would make people's lives easier, we could merge everything into one 
repo/process.

Integrations would be distributed as part of their respective packages (e.g. 
Arrow Dataset would optionally link to the driver manager). So the "part of 
Arrow 10.0.0" aspect means having a stable interface for adbc.h, and getting 
the Flight SQL drivers into the main repo.

[1]: https://github.com/apache/arrow-adbc/issues/53

On Thu, Aug 25, 2022, at 11:34, Antoine Pitrou wrote:
> On Fri, 19 Aug 2022 14:09:44 -0400
> "David Li" <lidav...@apache.org> wrote:
>> Since it's been a while, I'd like to give an update. There are also a few 
>> questions I have around distribution.
>> 
>> Currently:
>> - Supported in C, Java, and Python.
>> - For C/Python, there are basic drivers wrapping Flight SQL and SQLite, with 
>> a draft of a libpq (Postgres) driver (using nanoarrow).
>> - For Java, there are drivers wrapping JDBC and Flight SQL.
>> - For Python, there's low-level bindings to the C API, and the DBAPI 
>> interface on top of that (+a few extension methods resembling 
>> DuckDB/Turbodbc).
>>  
>> There's drafts of integration with Ibis [1], DBI (R), and DuckDB. (I'd like 
>> to thank Hannes and Kirill for their comments, as well as Antoine, Dewey, 
>> and Matt here.)
>> 
>> I'd like to have this as part of 10.0.0 in some fashion. However, I'm not 
>> sure how we would like to handle packaging and distribution. In particular, 
>> there are several sub-components for each language (the driver manager + the 
>> drivers), increasing the work. Any thoughts here?
>
> Sorry, forgot to answer here. But I think your question is too broadly
> formulated. It probably deserves a case-by-case discussion, IMHO.
>
>> I'm also wondering how we want to handle this in terms of specification - I 
>> assume we'd consider the core header file/Java interfaces a spec like the C 
>> Data Interface/Flight RPC, and vote on them/mirror them into the format/ 
>> directory?
>
> That sounds like the right way to me indeed.
>
> Regards
>
> Antoine.

Re: [DISC] Improving Arrow's database support

Reply via email to