The only thing I don't like it being a private module in the Go implementation 
is distribution. For native go code, consumers can just perform `go get` and 
have it work. But for this interface, it would require both consumers of the 
module and any consumers of those consumers to have a local built version of 
this library locally when building their Go code. Easy to static link in for 
distributing binaries, but not for library builders. 

Currently, the Arrow C++ source tree, already has everything set up and 
configured for being able to distribute the build artifacts for the various 
platforms, which I assume is also why the C++ code for the JNI dataset library 
is in the C++ source tree (correct me if I'm wrong please). The Golang build 
and deploy scripts don't have such a deployment because there typically is no 
need for such a deployment with Go. So even if it's a separate private module, 
I'd still prefer for it to at least be in the cpp source tree (perhaps a 
cpp/src/cgo directory?) in order to benefit from the existing build and CI 
tooling for deployment and distribution. This way as long as the necessary 
dependency (i.e. "apt install libarrow_dataset_cgo") exists, then `go get 
github.com/apache/arrow/go/dataset` would work without issue, rather than 
requiring additional steps for developers.

Unless there's an easy way to grab the c++ code from the Go source tree in this 
case and add it to the libraries being deployed from the C++ build? I'm not 
familiar enough with that deployment configuration to know if it's actually 
easy to hook into for compiling and deploying a library that isn't in the C++ 
source tree.

-----Original Message-----
From: Antoine Pitrou <anto...@python.org> 
Sent: Monday, August 23, 2021 1:24 PM
To: dev@arrow.apache.org
Subject: Re: [C++][Go] CGO For Dataset API Integration


Le 23/08/2021 à 19:16, Matthew Topol a écrit :
> Unfortunately, Go currently can only integrate with C++ libraries through a C 
> interface. There does exist SWIG which is a generator for creating interface 
> code between Go and C++, but ultimately it's just automating the creation of 
> a C interface and Go glue code. Personally I'm not a fan of the code that 
> SWIG generates and haven't had too much luck with it.
> 
> I have a working POC of using the datasets API via CGO through a C interface 
> (basically just passing around a uintptr_t which is the address of a heap 
> allocated shared_ptr to a DatasetFactory/Dataset/Scanner and using the C Data 
> interface for passing the resulting record batches through without copying), 
> but couldn't decide on the best way to go about integrating the idea and 
> cleaning it up into a real PR, hence this email thread. I initially was 
> porting the Dataset API to Go, but ran into the fact that it uses the compute 
> expression classes to define things and perform the filtering and realized 
> that it wouldn't be a good idea to try porting the entire compute library.
> 
> So it just becomes a question as to what level I do the implementation and at 
> what level do I make the calls to a C interface to call into the C++, and 
> then whether or not the interface is a separate component from the existing 
> dataset/compute libraries which can get linked into the Go, optionally as a 
> separate module so that it's not creating a dependency on the C++ libraries 
> for the current arrow Go implementation, only for using the Dataset API stuff 
> (and potentially the compute library).

I think the dataset C interface can start as a private module in the Go 
implementation.  If it may be useful to other people then we can consider 
transferring it into the Arrow C++ source tree.

Regards

Antoine.

Reply via email to