I agree that giving direct access to protobuf classes is not Arrow's job. You can probably take the upstream (i.e. Substrait's) protobuf definitions and compile them yourself, using whatever settings required by your project.

Regards

Antoine.


Le 03/07/2022 à 21:16, Jeroen van Straten a écrit :
It's not so much about whether or not we *can*, but about whether or not we
*should* generate and expose these files.

Fundamentally, Substrait aims to be a means to connect different systems
together by standardizing some interchange format. Currently that happens
to be based on protobuf, so *one* (obvious) way to generate and interpret
Substrait plans is to use Google's own protobuf implementation, as
generated by protoc for various languages. It's true that there's nothing
fundamentally blocking Arrow from exposing those.

However, it's by no means the *only* way to interact with protobuf
serializations, and Google's own implementation is a hot mess when it comes
to dependencies; there are lots of good reasons why you might not want to
depend on it and opt for a third-party implementation instead. For one
thing, the Python wheel depends on system libraries beyond manylinux, and
if they happen to be incompatible (which is likely) it just explodes unless
you set some environment variable. Therefore, assuming pyarrow doesn't
already depend on protobuf, I feel like we should keep it that way, and
thus that we should not include the generated Python files in the release.
Note that we don't even expose/leak the protoc-generated C++ classes for
similar reasons.

Also, as Weston already pointed out, it's not really our job; Substrait
aims to publish bindings for various languages by itself. It just doesn't
expose them for Python *yet*. In the interim I suppose you could use the
substrait-validator package from PyPI. It does expose them, as well as some
convenient conversion functions, but I'm having trouble finding people to
help me keep the validator maintained.

I suppose versioning would be difficult either way, since Substrait
semi-regularly pushes breaking changes and Arrow currently lags behind by
several months (though I have a PR open for Substrait 0.6). I guess from
that point of view distributing the right version along with pyarrow seems
nice, but the issues of Google's protobuf implementation remain. This being
an issue at all is also very much a Substrait problem, not an Arrow
problem; at best we can try to mitigate it.

Jeroen

On Sun, Jul 3, 2022, 17:51 Yaron Gvili <rt...@hotmail.com> wrote:

I looked into the Arrow build system some more. It is possible to get the
Python classes generated by adding "--python-out" flag (set to a directory
created for it) to the `${ARROW_PROTOBUF_PROTOC}` command under
`macro(build_substrait)` in `cpp/cmake_modules/ThirdpartyToolchain.cmake`.
However, this makes them available only in the Arrow C++ build whereas for
the current purpose they need to be available in the PyArrow build. The
PyArrow build calls `cmake` on `python/CMakeLists.txt`, which AFAICS has
access to `cpp/cmake_modules`. So, one solution could be to pull
`macro(build_substrait)` into `python/CMakeLists.txt` and call it to
generate the Python protobuf classes under `python/`, making them available
for import by PyArrow code. This would probably be cleaner with some macro
parameters to distinguish between C++ and Python generation.

Does this sound like a reasonable approach?


Yaron.

________________________________
From: Yaron Gvili <rt...@hotmail.com>
Sent: Saturday, July 2, 2022 8:55 AM
To: dev@arrow.apache.org <dev@arrow.apache.org>; Phillip Cloud <
cpcl...@gmail.com>
Subject: Re: accessing Substrait protobuf Python classes from PyArrow

I'm somewhat confused by this answer because I think resolving the issue I
raised does not require any change outside PyArrow. I'll try to explain the
issue differently.

First, let me describe the current situation with Substrait protobuf in
Arrow C++. The Arrow C++ build system handles external projects in
`cpp/cmake_modules/ThirdpartyToolchain.cmake`, and one of these external
projects is "substrait". By default, the build system takes the source code
for "substrait" from `
https://github.com/substrait-io/substrait/archive/${ARROW_SUBSTRAIT_BUILD_VERSION}.tar.gz
` where `ARROW_SUBSTRAIT_BUILD_VERSION` is set in
`cpp/thirdparty/versions.txt`. The source code is check-summed and unpacked
in `substrait_ep-prefix` under the build directory and from this the
protobuf C++ classes are generated in `*.pb.{h,cc}` files in
`substrait_ep-generated` under the build directory. The build system makes
a library using the `*.cc` files and makes the `*.h` files available for
other C++ modules to use.

Setting up the above mechanism did not require any change in the
`substrait-io/substrait` repo, nor any coordination with its authors. What
I'm looking for is a similar build mechanism for PyArrow that builds
Substrait protobuf Python classes and makes them available for use by other
PyArrow modules. I believe this PyArrow build mechanism does not exist
currently and that setting up one would not require any changes outside
PyArrow. I'm asking (1) whether that's indeed the case, (2) whether others
agree this mechanism is needed at least due to the problem I ran into that
I previously described, and (3) for any thoughts about how to set up this
mechanism assuming it is needed.

Weston, perhaps your thinking was that the Substrait protobuf Python
classes need to be built by a repo in the substrait-io space and made
available as a binary+headers package? This can be done but will require
involving Substrait people and appears to be inconsistent with current
patterns in the Arrow build system. Note that for my purposes here, the
Substrait protobuf Python classes will be used for composing or
interpreting a Substrait plan, not for transforming it by an optimizer,
though a Python-based optimizer is a valid use case for them.


Yaron.
________________________________
From: Weston Pace <weston.p...@gmail.com>
Sent: Friday, July 1, 2022 12:42 PM
To: dev@arrow.apache.org <dev@arrow.apache.org>; Phillip Cloud <
cpcl...@gmail.com>
Subject: Re: accessing Substrait protobuf Python classes from PyArrow

Given that Acero does not do any planner / optimizer type tasks I'm
not sure you will find anything like this in arrow-cpp or pyarrow.
What you are describing I sometimes refer to as "plan slicing and
dicing".  I have wondered if we will someday need this in Acero but I
fear it is a slippery slope between "a little bit of plan
manipulation" and "a full blown planner" so I've shied away from it.
My first spot to look would be a substrait-python repository which
would belong here: https://github.com/substrait-io

However, it does not appear that such a repository exists.  If you're
willing to create one then a quick ask on the Substrait Slack instance
should be enough to get the repository created.  Perhaps there is some
genesis of this library in Ibis although I think Ibis would use its
own representation for slicing and dicing and only use Substrait for
serialization.

Once that repository is created pyarrow could probably import it but
unless this plan manipulation makes sense purely from a pyarrow
perspective I would rather prefer that the user application import
both pyarrow and substrait-python independently.

Perhaps @Phillip Cloud or someone from the Ibis space might have some
ideas on where this might be found.

-Weston

On Thu, Jun 30, 2022 at 10:06 AM Yaron Gvili <rt...@hotmail.com> wrote:

Hi,

Is there support for accessing Substrait protobuf Python classes (such
as Plan) from PyArrow? If not, how should such support be added? For
example, should the PyArrow build system pull in the Substrait repo as an
external project and build its protobuf Python classes, in a manner similar
to how Arrow C++ does it?

I'm pondering these questions after running into an issue with code I'm
writing under PyArrow that parses a Substrait plan represented as a
dictionary. The current (and kind of shaky) parsing operation in this code
uses json.dumps() on the dictionary, which results in a string that is
passed to a Cython API that handles it using Arrow C++ code that has access
to Substrait protobuf C++ classes. But when the Substrait plan contains a
bytes-type, json.dump() no longer works and fails with "TypeError: Object
of type bytes is not JSON serializable". A fix for this, and a better way
to parse, is using google.protobuf.json_format.ParseDict() [1] on the
dictionary. However, this invocation requires a second argument, namely a
protobuf message instance to merge with. The class of this message (such as
Plan) is a Substrait protobuf Python class, hence the need to access such
classes from PyArrow.

[1]
https://googleapis.dev/python/protobuf/latest/google/protobuf/json_format.html


Yaron.


Reply via email to