Re: [DISC] Improving Arrow's database support

Sutou Kouhei Sat, 03 Sep 2022 15:24:47 -0700

Hi,

> Do we have a preference for versioning strategy? Should we
> proceed in lockstep with the Arrow C++ library et. al. and
> release "ADBC 1.0.0" (the API standard) with "drivers
> version 10.0.0", or use an independent versioning scheme?
> (For example, release API standard and components at
> "1.0.0". Then further releases of components that do not
> change the spec would be "1.1", "1.2", ...; if/when we
> change the spec, start over with "2.0", "2.1", ...)


I like an independent versioning schema. I assume that ADBC
doesn't need backward incompatible changes frequently. How
about incrementing major version only when ADBC needs
any backward incompatible changes?

e.g.:

  1.  Release ADBC (the API standard) 1.0.0
  2.  Release adbc_driver_manager 1.0.0
  3.  Release adbc_driver_postgres 1.0.0
  4.  Add a new feature to adbc_driver_postgres without
      any backward incompatible changes
  5.  Release adbc_driver_postgres 1.1.0
  6.  Fix a bug in adbc_driver_manager without
      any backward incompatible changes
  7.  Release adbc_driver_manager 1.0.1
  8.  Add a backward incompatible change to adbc_driver_manager
  9.  Release adbc_driver_manager 2.0.0
  10. Add a new feature to ADBC without any
      backward incompatible changes
  11. Release ADBC (the API standard) 1.1.0


Thanks,
-- 
kou

In <[email protected]>
  "Re: [DISC] Improving Arrow's database support" on Thu, 01 Sep 2022 16:36:43 
-0400,
  "David Li" <[email protected]> wrote:

> Following up here with some specific questions:
> 
> Matt Topol added some Go definitions [1] (thanks!) I'd assume we want to vote 
> on those as well?
> 
> How should the process work for Java/Go? For C/C++, I assume we'd treat it 
> like the C Data Interface and copy adbc.h to format/ after a vote, and then 
> vote on releases of components. Or do we really only consider the C header as 
> the 'format', with the others being language-specific affordances?
> 
> What about for Java and for Go? We could vote on and tag a release for Go, 
> and add a documentation page that links to the Java/Go definitions at a 
> specific revision (as the equivalent 'format' definition for Java/Go)? Or 
> would we vendor the entire Java module/Go package as the 'format'?
> 
> Do we have a preference for versioning strategy? Should we proceed in 
> lockstep with the Arrow C++ library et. al. and release "ADBC 1.0.0" (the API 
> standard) with "drivers version 10.0.0", or use an independent versioning 
> scheme? (For example, release API standard and components at "1.0.0". Then 
> further releases of components that do not change the spec would be "1.1", 
> "1.2", ...; if/when we change the spec, start over with "2.0", "2.1", ...)
> 
> [1]: https://github.com/apache/arrow-adbc/blob/main/go/adbc/adbc.go
> 
> -David
> 
> On Sun, Aug 28, 2022, at 10:56, Sutou Kouhei wrote:
>> Hi,
>>
>> OK. I'll send pull requests for GLib and Ruby soon.
>>
>>> I'm curious if you have a particular use case in mind.
>>
>> I don't have any production-ready use case yet but I want to
>> implement an Active Record adapter for ADBC. Active Record
>> is the O/R mapper for Ruby on Rails. Implementing Web
>> application by Ruby on Rails is one of major Ruby use
>> cases. So providing Active Record interface for ADBC will
>> increase Apache Arrow users in Ruby community.
>>
>> NOTE: Generally, Ruby on Rails users don't process large
>> data but they sometimes need to process large (medium?) data
>> in a batch process. Active Record adapter for ADBC may be
>> useful for such use case.
>>
>>> There's a little bit more API cleanup to do [1]. If you
>>> have comments on that or anything else, I'd appreciate
>>> them. Otherwise, pull requests would also be appreciated.
>>
>> OK. I'll open issues/pull requests when I find
>> something. For now, I think that "MODULE" type library
>> instead of "SHARED" type library in CMake terminology
>> [cmake] is better for driver modules. (I'll open an issue
>> for this later.)
>>
>> [cmake]: https://cmake.org/cmake/help/latest/command/add_library.html
>>
>>
>> Thanks,
>> -- 
>> kou
>>
>> In <[email protected]>
>>   "Re: [DISC] Improving Arrow's database support" on Sat, 27 Aug 2022 
>> 15:28:56 -0400,
>>   "David Li" <[email protected]> wrote:
>>
>>> I would be very happy to see GLib/Ruby bindings! I'm curious if you have a 
>>> particular use case in mind. 
>>> 
>>> There's a little bit more API cleanup to do [1]. If you have comments on 
>>> that or anything else, I'd appreciate them. Otherwise, pull requests would 
>>> also be appreciated.
>>> 
>>> [1]: https://github.com/apache/arrow-adbc/issues/79
>>> 
>>> On Fri, Aug 26, 2022, at 21:53, Sutou Kouhei wrote:
>>>> Hi,
>>>>
>>>> Thanks for sharing the current status!
>>>> I understand.
>>>>
>>>> BTW, can I add GLib/Ruby bindings to apache/arrow-adbc
>>>> before we release the first version? (I want to use ADBC
>>>> from Ruby.) Or should I wait for the first release? If I can
>>>> work on it now, I'll open pull requests for it.
>>>>
>>>> Thanks,
>>>> -- 
>>>> kou
>>>>
>>>> In <[email protected]>
>>>>   "Re: [DISC] Improving Arrow's database support" on Fri, 26 Aug 2022 
>>>> 11:03:26 -0400,
>>>>   "David Li" <[email protected]> wrote:
>>>>
>>>>> Thank you Kou!
>>>>> 
>>>>> At least initially, I don't think I'll be able to complete the Dataset 
>>>>> integration in time. So 10.0.0 probably won't ship with a hard 
>>>>> dependency. That said I am hoping to have PyArrow take an optional 
>>>>> dependency (so Flight SQL can finally be available from Python).
>>>>> 
>>>>> On Fri, Aug 26, 2022, at 01:01, Sutou Kouhei wrote:
>>>>>> Hi,
>>>>>>
>>>>>> As a maintainer of Linux packages, I want apache/arrow-adbc
>>>>>> to be released before apache/arrow is released so that
>>>>>> apache/arrow's .deb/.rpm can depend on apache/arrow-adbc's
>>>>>> .deb/.rpm.
>>>>>>
>>>>>> (If Apache Arrow Dataset uses apache/arrow-adbc,
>>>>>> apache/arrow's .deb/.rpm needs to depend on
>>>>>> apache/arrow-adbc's .deb/.rpm.)
>>>>>>
>>>>>> We can add .deb/.rpm related files
>>>>>> (dev/tasks/linux-packages/ in apache/arrow) to
>>>>>> apache/arrow-adbc to build .deb/.rpm for apache/arrow-adbc.
>>>>>>
>>>>>> FYI: I did it for datafusion-contrib/datafusion-c:
>>>>>>
>>>>>> * https://github.com/datafusion-contrib/datafusion-c/tree/main/package
>>>>>> * 
>>>>>> https://github.com/datafusion-contrib/datafusion-c/blob/main/.github/workflows/package.yaml
>>>>>>
>>>>>> I can work on it in apache/arrow-adbc.
>>>>>>
>>>>>>
>>>>>> Thanks,
>>>>>> -- 
>>>>>> kou
>>>>>>
>>>>>> In <[email protected]>
>>>>>>   "Re: [DISC] Improving Arrow's database support" on Thu, 25 Aug 2022 
>>>>>> 11:51:08 -0400,
>>>>>>   "David Li" <[email protected]> wrote:
>>>>>>
>>>>>>> Fair enough, thank you. I'll try to expand a bit. (Sorry for the wall 
>>>>>>> of text that follows…)
>>>>>>> 
>>>>>>> These are the components:
>>>>>>> 
>>>>>>> - Core adbc.h header
>>>>>>> - Driver manager for C/C++
>>>>>>> - Flight SQL-based driver
>>>>>>> - Postgres-based driver (WIP)
>>>>>>> - SQLite-based driver (more of a testbed for me than an actual 
>>>>>>> component - I don't think we'd actually distribute this)
>>>>>>> - Java core interfaces
>>>>>>> - Java driver manager
>>>>>>> - Java JDBC-based driver
>>>>>>> - Java Flight SQL-based driver
>>>>>>> - Python driver manager
>>>>>>> 
>>>>>>> I think: adbc.h gets mirrored into the Arrow repo. The Flight SQL 
>>>>>>> drivers get moved to the main Arrow repo and distributed as part of the 
>>>>>>> regular Arrow releases.
>>>>>>> 
>>>>>>> For the rest of the components: they could be packaged individually, 
>>>>>>> but versioned and released together. Also, each C/C++ driver probably 
>>>>>>> needs a corresponding Python package so Python users do not have to 
>>>>>>> futz with shared library configurations. (See [1].) So for instance, 
>>>>>>> installing PyArrow would also give you the Flight SQL driver, and `pip 
>>>>>>> install adbc_postgres` would get you the Postgres-based driver.
>>>>>>> 
>>>>>>> That would mean setting up separate CI, release, etc. (and eventually 
>>>>>>> linking Crossbow & Conbench as well?). That does mean duplication of 
>>>>>>> effort, but the trade off is avoiding bloating the main release process 
>>>>>>> even further. However, I'd like to hear from those closer to the 
>>>>>>> release process on this subject - if it would make people's lives 
>>>>>>> easier, we could merge everything into one repo/process.
>>>>>>> 
>>>>>>> Integrations would be distributed as part of their respective packages 
>>>>>>> (e.g. Arrow Dataset would optionally link to the driver manager). So 
>>>>>>> the "part of Arrow 10.0.0" aspect means having a stable interface for 
>>>>>>> adbc.h, and getting the Flight SQL drivers into the main repo.
>>>>>>> 
>>>>>>> [1]: https://github.com/apache/arrow-adbc/issues/53
>>>>>>> 
>>>>>>> On Thu, Aug 25, 2022, at 11:34, Antoine Pitrou wrote:
>>>>>>>> On Fri, 19 Aug 2022 14:09:44 -0400
>>>>>>>> "David Li" <[email protected]> wrote:
>>>>>>>>> Since it's been a while, I'd like to give an update. There are also a 
>>>>>>>>> few questions I have around distribution.
>>>>>>>>> 
>>>>>>>>> Currently:
>>>>>>>>> - Supported in C, Java, and Python.
>>>>>>>>> - For C/Python, there are basic drivers wrapping Flight SQL and 
>>>>>>>>> SQLite, with a draft of a libpq (Postgres) driver (using nanoarrow).
>>>>>>>>> - For Java, there are drivers wrapping JDBC and Flight SQL.
>>>>>>>>> - For Python, there's low-level bindings to the C API, and the DBAPI 
>>>>>>>>> interface on top of that (+a few extension methods resembling 
>>>>>>>>> DuckDB/Turbodbc).
>>>>>>>>>  
>>>>>>>>> There's drafts of integration with Ibis [1], DBI (R), and DuckDB. 
>>>>>>>>> (I'd like to thank Hannes and Kirill for their comments, as well as 
>>>>>>>>> Antoine, Dewey, and Matt here.)
>>>>>>>>> 
>>>>>>>>> I'd like to have this as part of 10.0.0 in some fashion. However, I'm 
>>>>>>>>> not sure how we would like to handle packaging and distribution. In 
>>>>>>>>> particular, there are several sub-components for each language (the 
>>>>>>>>> driver manager + the drivers), increasing the work. Any thoughts here?
>>>>>>>>
>>>>>>>> Sorry, forgot to answer here. But I think your question is too broadly
>>>>>>>> formulated. It probably deserves a case-by-case discussion, IMHO.
>>>>>>>>
>>>>>>>>> I'm also wondering how we want to handle this in terms of 
>>>>>>>>> specification - I assume we'd consider the core header file/Java 
>>>>>>>>> interfaces a spec like the C Data Interface/Flight RPC, and vote on 
>>>>>>>>> them/mirror them into the format/ directory?
>>>>>>>>
>>>>>>>> That sounds like the right way to me indeed.
>>>>>>>>
>>>>>>>> Regards
>>>>>>>>
>>>>>>>> Antoine.

Re: [DISC] Improving Arrow's database support

Reply via email to