That's a fair point, and part of the work I've done so far is a local Go 
implementation of at least consuming the C data interface. It will also 
eventually involve creating the necessary implementation to produce the C-Data 
interface too. But specifically I'm asking for opinions on using that C-Data 
interface to build a C *programming* interface to the C++ Dataset API in the 
same vein as the JNI interface, so that Go could use the dataset api without 
having to reimplement the entirety of it. 

Given the difference between a *programming* interface and a *data* interface, 
I suppose the recommendation would be that creating a C Programming Interface 
for the Dataset API (using the C-Data interface for producing/consuming the 
actual Arrow data) should be a separate component like libarrow_dataset_jni 
rather than integrating it directly into the dataset component. Right?

If it's not necessary for there to be Go specific things in the interface, then 
it could just be called *libarrow_dataset_c* or something equivalent, but would 
still be a separate component which just relies on the dataset api rather than 
being integrated into it. Does that make sense?

Alternately, I could create a Go implementation of the dataset API, but then 
use CGO to make the necessary calls to the compute/gandiva apis at that level, 
instead of at the dataset API level. I'm trying to find the right balance 
between maintainability and complexity as it's certainly not a long-term viable 
idea to reimplement the entire compute library using Go as then it would need 
to be maintained separately from the C++ implementation, rather than just being 
able to hook into the C++ implementation directly (which I presume is the 
motivation for using JNI to do the same, aside from performance). 

--Matt

-----Original Message-----
From: Antoine Pitrou <anto...@python.org> 
Sent: Monday, August 23, 2021 12:00 PM
To: dev@arrow.apache.org
Subject: Re: [C++][Go] CGO For Dataset API Integration


Hi Matt,

As the name suggests, the C data interface is not a *programming* interface.  
It is a data sharing convention which relies on the existence of dedicated 
endpoints to produce or consume the C data structures.

For example in Arrow C++, there is this set of APIs:
https://urldefense.com/v3/__https://arrow.apache.org/docs/cpp/api/c_abi.html*c-data-interface__;Iw!!PBKjc0U4!fVwgcRfC6MNCyEGP85Y7Zw2mMMtoRWhTOwbXLzU2HxxKFtP-yDuTQLu1NCWGfMM$
 

In PyArrow:
https://urldefense.com/v3/__https://github.com/apache/arrow/blob/master/python/pyarrow/array.pxi*L1267-L1305__;Iw!!PBKjc0U4!fVwgcRfC6MNCyEGP85Y7Zw2mMMtoRWhTOwbXLzU2HxxKFtP-yDuTQLu1zUaRHBo$
 

In Rust:
https://urldefense.com/v3/__https://docs.rs/arrow/5.2.0/arrow/array/trait.Array.html*method.to_raw__;Iw!!PBKjc0U4!fVwgcRfC6MNCyEGP85Y7Zw2mMMtoRWhTOwbXLzU2HxxKFtP-yDuTQLu1m32Eito$
https://urldefense.com/v3/__https://docs.rs/arrow/5.2.0/arrow/array/fn.make_array_from_raw.html__;!!PBKjc0U4!fVwgcRfC6MNCyEGP85Y7Zw2mMMtoRWhTOwbXLzU2HxxKFtP-yDuTQLu1SV0sjKs$
 

The first thing to do would be for the Go implementation to implement the C 
data interface.

Regards

Antoine.



Le 23/08/2021 à 16:07, Matthew Topol a écrit :
> Hey All,
> 
> So I've been working on a use case where I needed to be able to use the 
> Dataset API from Golang and instead of trying to port all of it to Golang 
> (which would require porting the Compute side too) I decided to create a 
> proof of concept using CGO to just call into the existing C++ code in a 
> similar fashion to how the Java solution is using JNI for the same thing. 
> After proving to myself it works I came up with a question that I figured 
> would be best sent to this mailing list.
> 
> When building it out, CGO just needs a C-API exposed for it to work and while 
> there is a C Data interface designed for using Arrow, there is not currently 
> a C Data Interface designed for the Dataset API. As a result, the big 
> question is that if I wanted to contribute the work to the Arrow Repo, should 
> a C Interface for the Dataset API be put as a separate directory and separate 
> build artifact like the JNI interface, or should it just be directly added to 
> and exported from the Dataset library? It's an organizational question 
> because either way it would need to exist on anywhere that the Go code that 
> wants to hit it would be being built, so it's the difference between just 
> needing libarrow_dataset.so (and it's dependencies) or needing that *and* 
> libarrow_dataset_cgo.so/.a, etc.
> 
> I'm curious what everyone's opinions might be on this so I can get an idea of 
> which direction I should go before trying to put a PR together.
> 
> Thanks everyone!
> 
> --Matt Topol
> 

Reply via email to