That's a fair point, and part of the work I've done so far is a local Go implementation of at least consuming the C data interface. It will also eventually involve creating the necessary implementation to produce the C-Data interface too. But specifically I'm asking for opinions on using that C-Data interface to build a C *programming* interface to the C++ Dataset API in the same vein as the JNI interface, so that Go could use the dataset api without having to reimplement the entirety of it.
Given the difference between a *programming* interface and a *data* interface, I suppose the recommendation would be that creating a C Programming Interface for the Dataset API (using the C-Data interface for producing/consuming the actual Arrow data) should be a separate component like libarrow_dataset_jni rather than integrating it directly into the dataset component. Right? If it's not necessary for there to be Go specific things in the interface, then it could just be called *libarrow_dataset_c* or something equivalent, but would still be a separate component which just relies on the dataset api rather than being integrated into it. Does that make sense? Alternately, I could create a Go implementation of the dataset API, but then use CGO to make the necessary calls to the compute/gandiva apis at that level, instead of at the dataset API level. I'm trying to find the right balance between maintainability and complexity as it's certainly not a long-term viable idea to reimplement the entire compute library using Go as then it would need to be maintained separately from the C++ implementation, rather than just being able to hook into the C++ implementation directly (which I presume is the motivation for using JNI to do the same, aside from performance). --Matt -----Original Message----- From: Antoine Pitrou <anto...@python.org> Sent: Monday, August 23, 2021 12:00 PM To: dev@arrow.apache.org Subject: Re: [C++][Go] CGO For Dataset API Integration Hi Matt, As the name suggests, the C data interface is not a *programming* interface. It is a data sharing convention which relies on the existence of dedicated endpoints to produce or consume the C data structures. For example in Arrow C++, there is this set of APIs: https://urldefense.com/v3/__https://arrow.apache.org/docs/cpp/api/c_abi.html*c-data-interface__;Iw!!PBKjc0U4!fVwgcRfC6MNCyEGP85Y7Zw2mMMtoRWhTOwbXLzU2HxxKFtP-yDuTQLu1NCWGfMM$ In PyArrow: https://urldefense.com/v3/__https://github.com/apache/arrow/blob/master/python/pyarrow/array.pxi*L1267-L1305__;Iw!!PBKjc0U4!fVwgcRfC6MNCyEGP85Y7Zw2mMMtoRWhTOwbXLzU2HxxKFtP-yDuTQLu1zUaRHBo$ In Rust: https://urldefense.com/v3/__https://docs.rs/arrow/5.2.0/arrow/array/trait.Array.html*method.to_raw__;Iw!!PBKjc0U4!fVwgcRfC6MNCyEGP85Y7Zw2mMMtoRWhTOwbXLzU2HxxKFtP-yDuTQLu1m32Eito$ https://urldefense.com/v3/__https://docs.rs/arrow/5.2.0/arrow/array/fn.make_array_from_raw.html__;!!PBKjc0U4!fVwgcRfC6MNCyEGP85Y7Zw2mMMtoRWhTOwbXLzU2HxxKFtP-yDuTQLu1SV0sjKs$ The first thing to do would be for the Go implementation to implement the C data interface. Regards Antoine. Le 23/08/2021 à 16:07, Matthew Topol a écrit : > Hey All, > > So I've been working on a use case where I needed to be able to use the > Dataset API from Golang and instead of trying to port all of it to Golang > (which would require porting the Compute side too) I decided to create a > proof of concept using CGO to just call into the existing C++ code in a > similar fashion to how the Java solution is using JNI for the same thing. > After proving to myself it works I came up with a question that I figured > would be best sent to this mailing list. > > When building it out, CGO just needs a C-API exposed for it to work and while > there is a C Data interface designed for using Arrow, there is not currently > a C Data Interface designed for the Dataset API. As a result, the big > question is that if I wanted to contribute the work to the Arrow Repo, should > a C Interface for the Dataset API be put as a separate directory and separate > build artifact like the JNI interface, or should it just be directly added to > and exported from the Dataset library? It's an organizational question > because either way it would need to exist on anywhere that the Go code that > wants to hit it would be being built, so it's the difference between just > needing libarrow_dataset.so (and it's dependencies) or needing that *and* > libarrow_dataset_cgo.so/.a, etc. > > I'm curious what everyone's opinions might be on this so I can get an idea of > which direction I should go before trying to put a PR together. > > Thanks everyone! > > --Matt Topol >