Re: [DISC] Improving Arrow's database support

David Li Sat, 27 Aug 2022 12:30:33 -0700

I would be very happy to see GLib/Ruby bindings! I'm curious if you have a 
particular use case in mind.


There's a little bit more API cleanup to do [1]. If you have comments on that 
or anything else, I'd appreciate them. Otherwise, pull requests would also be 
appreciated.

[1]: https://github.com/apache/arrow-adbc/issues/79

On Fri, Aug 26, 2022, at 21:53, Sutou Kouhei wrote:
> Hi,
>
> Thanks for sharing the current status!
> I understand.
>
> BTW, can I add GLib/Ruby bindings to apache/arrow-adbc
> before we release the first version? (I want to use ADBC
> from Ruby.) Or should I wait for the first release? If I can
> work on it now, I'll open pull requests for it.
>
> Thanks,
> -- 
> kou
>
> In <[email protected]>
>   "Re: [DISC] Improving Arrow's database support" on Fri, 26 Aug 2022 
> 11:03:26 -0400,
>   "David Li" <[email protected]> wrote:
>
>> Thank you Kou!
>> 
>> At least initially, I don't think I'll be able to complete the Dataset 
>> integration in time. So 10.0.0 probably won't ship with a hard dependency. 
>> That said I am hoping to have PyArrow take an optional dependency (so Flight 
>> SQL can finally be available from Python).
>> 
>> On Fri, Aug 26, 2022, at 01:01, Sutou Kouhei wrote:
>>> Hi,
>>>
>>> As a maintainer of Linux packages, I want apache/arrow-adbc
>>> to be released before apache/arrow is released so that
>>> apache/arrow's .deb/.rpm can depend on apache/arrow-adbc's
>>> .deb/.rpm.
>>>
>>> (If Apache Arrow Dataset uses apache/arrow-adbc,
>>> apache/arrow's .deb/.rpm needs to depend on
>>> apache/arrow-adbc's .deb/.rpm.)
>>>
>>> We can add .deb/.rpm related files
>>> (dev/tasks/linux-packages/ in apache/arrow) to
>>> apache/arrow-adbc to build .deb/.rpm for apache/arrow-adbc.
>>>
>>> FYI: I did it for datafusion-contrib/datafusion-c:
>>>
>>> * https://github.com/datafusion-contrib/datafusion-c/tree/main/package
>>> * 
>>> https://github.com/datafusion-contrib/datafusion-c/blob/main/.github/workflows/package.yaml
>>>
>>> I can work on it in apache/arrow-adbc.
>>>
>>>
>>> Thanks,
>>> -- 
>>> kou
>>>
>>> In <[email protected]>
>>>   "Re: [DISC] Improving Arrow's database support" on Thu, 25 Aug 2022 
>>> 11:51:08 -0400,
>>>   "David Li" <[email protected]> wrote:
>>>
>>>> Fair enough, thank you. I'll try to expand a bit. (Sorry for the wall of 
>>>> text that follows…)
>>>> 
>>>> These are the components:
>>>> 
>>>> - Core adbc.h header
>>>> - Driver manager for C/C++
>>>> - Flight SQL-based driver
>>>> - Postgres-based driver (WIP)
>>>> - SQLite-based driver (more of a testbed for me than an actual component - 
>>>> I don't think we'd actually distribute this)
>>>> - Java core interfaces
>>>> - Java driver manager
>>>> - Java JDBC-based driver
>>>> - Java Flight SQL-based driver
>>>> - Python driver manager
>>>> 
>>>> I think: adbc.h gets mirrored into the Arrow repo. The Flight SQL drivers 
>>>> get moved to the main Arrow repo and distributed as part of the regular 
>>>> Arrow releases.
>>>> 
>>>> For the rest of the components: they could be packaged individually, but 
>>>> versioned and released together. Also, each C/C++ driver probably needs a 
>>>> corresponding Python package so Python users do not have to futz with 
>>>> shared library configurations. (See [1].) So for instance, installing 
>>>> PyArrow would also give you the Flight SQL driver, and `pip install 
>>>> adbc_postgres` would get you the Postgres-based driver.
>>>> 
>>>> That would mean setting up separate CI, release, etc. (and eventually 
>>>> linking Crossbow & Conbench as well?). That does mean duplication of 
>>>> effort, but the trade off is avoiding bloating the main release process 
>>>> even further. However, I'd like to hear from those closer to the release 
>>>> process on this subject - if it would make people's lives easier, we could 
>>>> merge everything into one repo/process.
>>>> 
>>>> Integrations would be distributed as part of their respective packages 
>>>> (e.g. Arrow Dataset would optionally link to the driver manager). So the 
>>>> "part of Arrow 10.0.0" aspect means having a stable interface for adbc.h, 
>>>> and getting the Flight SQL drivers into the main repo.
>>>> 
>>>> [1]: https://github.com/apache/arrow-adbc/issues/53
>>>> 
>>>> On Thu, Aug 25, 2022, at 11:34, Antoine Pitrou wrote:
>>>>> On Fri, 19 Aug 2022 14:09:44 -0400
>>>>> "David Li" <[email protected]> wrote:
>>>>>> Since it's been a while, I'd like to give an update. There are also a 
>>>>>> few questions I have around distribution.
>>>>>> 
>>>>>> Currently:
>>>>>> - Supported in C, Java, and Python.
>>>>>> - For C/Python, there are basic drivers wrapping Flight SQL and SQLite, 
>>>>>> with a draft of a libpq (Postgres) driver (using nanoarrow).
>>>>>> - For Java, there are drivers wrapping JDBC and Flight SQL.
>>>>>> - For Python, there's low-level bindings to the C API, and the DBAPI 
>>>>>> interface on top of that (+a few extension methods resembling 
>>>>>> DuckDB/Turbodbc).
>>>>>>  
>>>>>> There's drafts of integration with Ibis [1], DBI (R), and DuckDB. (I'd 
>>>>>> like to thank Hannes and Kirill for their comments, as well as Antoine, 
>>>>>> Dewey, and Matt here.)
>>>>>> 
>>>>>> I'd like to have this as part of 10.0.0 in some fashion. However, I'm 
>>>>>> not sure how we would like to handle packaging and distribution. In 
>>>>>> particular, there are several sub-components for each language (the 
>>>>>> driver manager + the drivers), increasing the work. Any thoughts here?
>>>>>
>>>>> Sorry, forgot to answer here. But I think your question is too broadly
>>>>> formulated. It probably deserves a case-by-case discussion, IMHO.
>>>>>
>>>>>> I'm also wondering how we want to handle this in terms of specification 
>>>>>> - I assume we'd consider the core header file/Java interfaces a spec 
>>>>>> like the C Data Interface/Flight RPC, and vote on them/mirror them into 
>>>>>> the format/ directory?
>>>>>
>>>>> That sounds like the right way to me indeed.
>>>>>
>>>>> Regards
>>>>>
>>>>> Antoine.

Re: [DISC] Improving Arrow's database support

Reply via email to