It's not so much about whether or not we *can*, but about whether or not we *should* generate and expose these files.
Fundamentally, Substrait aims to be a means to connect different systems together by standardizing some interchange format. Currently that happens to be based on protobuf, so *one* (obvious) way to generate and interpret Substrait plans is to use Google's own protobuf implementation, as generated by protoc for various languages. It's true that there's nothing fundamentally blocking Arrow from exposing those. However, it's by no means the *only* way to interact with protobuf serializations, and Google's own implementation is a hot mess when it comes to dependencies; there are lots of good reasons why you might not want to depend on it and opt for a third-party implementation instead. For one thing, the Python wheel depends on system libraries beyond manylinux, and if they happen to be incompatible (which is likely) it just explodes unless you set some environment variable. Therefore, assuming pyarrow doesn't already depend on protobuf, I feel like we should keep it that way, and thus that we should not include the generated Python files in the release. Note that we don't even expose/leak the protoc-generated C++ classes for similar reasons. Also, as Weston already pointed out, it's not really our job; Substrait aims to publish bindings for various languages by itself. It just doesn't expose them for Python *yet*. In the interim I suppose you could use the substrait-validator package from PyPI. It does expose them, as well as some convenient conversion functions, but I'm having trouble finding people to help me keep the validator maintained. I suppose versioning would be difficult either way, since Substrait semi-regularly pushes breaking changes and Arrow currently lags behind by several months (though I have a PR open for Substrait 0.6). I guess from that point of view distributing the right version along with pyarrow seems nice, but the issues of Google's protobuf implementation remain. This being an issue at all is also very much a Substrait problem, not an Arrow problem; at best we can try to mitigate it. Jeroen On Sun, Jul 3, 2022, 17:51 Yaron Gvili <rt...@hotmail.com> wrote: > I looked into the Arrow build system some more. It is possible to get the > Python classes generated by adding "--python-out" flag (set to a directory > created for it) to the `${ARROW_PROTOBUF_PROTOC}` command under > `macro(build_substrait)` in `cpp/cmake_modules/ThirdpartyToolchain.cmake`. > However, this makes them available only in the Arrow C++ build whereas for > the current purpose they need to be available in the PyArrow build. The > PyArrow build calls `cmake` on `python/CMakeLists.txt`, which AFAICS has > access to `cpp/cmake_modules`. So, one solution could be to pull > `macro(build_substrait)` into `python/CMakeLists.txt` and call it to > generate the Python protobuf classes under `python/`, making them available > for import by PyArrow code. This would probably be cleaner with some macro > parameters to distinguish between C++ and Python generation. > > Does this sound like a reasonable approach? > > > Yaron. > > ________________________________ > From: Yaron Gvili <rt...@hotmail.com> > Sent: Saturday, July 2, 2022 8:55 AM > To: dev@arrow.apache.org <dev@arrow.apache.org>; Phillip Cloud < > cpcl...@gmail.com> > Subject: Re: accessing Substrait protobuf Python classes from PyArrow > > I'm somewhat confused by this answer because I think resolving the issue I > raised does not require any change outside PyArrow. I'll try to explain the > issue differently. > > First, let me describe the current situation with Substrait protobuf in > Arrow C++. The Arrow C++ build system handles external projects in > `cpp/cmake_modules/ThirdpartyToolchain.cmake`, and one of these external > projects is "substrait". By default, the build system takes the source code > for "substrait" from ` > https://github.com/substrait-io/substrait/archive/${ARROW_SUBSTRAIT_BUILD_VERSION}.tar.gz > ` where `ARROW_SUBSTRAIT_BUILD_VERSION` is set in > `cpp/thirdparty/versions.txt`. The source code is check-summed and unpacked > in `substrait_ep-prefix` under the build directory and from this the > protobuf C++ classes are generated in `*.pb.{h,cc}` files in > `substrait_ep-generated` under the build directory. The build system makes > a library using the `*.cc` files and makes the `*.h` files available for > other C++ modules to use. > > Setting up the above mechanism did not require any change in the > `substrait-io/substrait` repo, nor any coordination with its authors. What > I'm looking for is a similar build mechanism for PyArrow that builds > Substrait protobuf Python classes and makes them available for use by other > PyArrow modules. I believe this PyArrow build mechanism does not exist > currently and that setting up one would not require any changes outside > PyArrow. I'm asking (1) whether that's indeed the case, (2) whether others > agree this mechanism is needed at least due to the problem I ran into that > I previously described, and (3) for any thoughts about how to set up this > mechanism assuming it is needed. > > Weston, perhaps your thinking was that the Substrait protobuf Python > classes need to be built by a repo in the substrait-io space and made > available as a binary+headers package? This can be done but will require > involving Substrait people and appears to be inconsistent with current > patterns in the Arrow build system. Note that for my purposes here, the > Substrait protobuf Python classes will be used for composing or > interpreting a Substrait plan, not for transforming it by an optimizer, > though a Python-based optimizer is a valid use case for them. > > > Yaron. > ________________________________ > From: Weston Pace <weston.p...@gmail.com> > Sent: Friday, July 1, 2022 12:42 PM > To: dev@arrow.apache.org <dev@arrow.apache.org>; Phillip Cloud < > cpcl...@gmail.com> > Subject: Re: accessing Substrait protobuf Python classes from PyArrow > > Given that Acero does not do any planner / optimizer type tasks I'm > not sure you will find anything like this in arrow-cpp or pyarrow. > What you are describing I sometimes refer to as "plan slicing and > dicing". I have wondered if we will someday need this in Acero but I > fear it is a slippery slope between "a little bit of plan > manipulation" and "a full blown planner" so I've shied away from it. > My first spot to look would be a substrait-python repository which > would belong here: https://github.com/substrait-io > > However, it does not appear that such a repository exists. If you're > willing to create one then a quick ask on the Substrait Slack instance > should be enough to get the repository created. Perhaps there is some > genesis of this library in Ibis although I think Ibis would use its > own representation for slicing and dicing and only use Substrait for > serialization. > > Once that repository is created pyarrow could probably import it but > unless this plan manipulation makes sense purely from a pyarrow > perspective I would rather prefer that the user application import > both pyarrow and substrait-python independently. > > Perhaps @Phillip Cloud or someone from the Ibis space might have some > ideas on where this might be found. > > -Weston > > On Thu, Jun 30, 2022 at 10:06 AM Yaron Gvili <rt...@hotmail.com> wrote: > > > > Hi, > > > > Is there support for accessing Substrait protobuf Python classes (such > as Plan) from PyArrow? If not, how should such support be added? For > example, should the PyArrow build system pull in the Substrait repo as an > external project and build its protobuf Python classes, in a manner similar > to how Arrow C++ does it? > > > > I'm pondering these questions after running into an issue with code I'm > writing under PyArrow that parses a Substrait plan represented as a > dictionary. The current (and kind of shaky) parsing operation in this code > uses json.dumps() on the dictionary, which results in a string that is > passed to a Cython API that handles it using Arrow C++ code that has access > to Substrait protobuf C++ classes. But when the Substrait plan contains a > bytes-type, json.dump() no longer works and fails with "TypeError: Object > of type bytes is not JSON serializable". A fix for this, and a better way > to parse, is using google.protobuf.json_format.ParseDict() [1] on the > dictionary. However, this invocation requires a second argument, namely a > protobuf message instance to merge with. The class of this message (such as > Plan) is a Substrait protobuf Python class, hence the need to access such > classes from PyArrow. > > > > [1] > https://googleapis.dev/python/protobuf/latest/google/protobuf/json_format.html > > > > > > Yaron. >