Yaron, do we need to parse the subtrait protobuf in Python so that we can get the UDFs and register them with Pyarrow?
On Mon, Jul 4, 2022 at 1:24 PM Yaron Gvili <rt...@hotmail.com> wrote: > This rewriting of the package is basically what I had in mind; the `_ep` > was just to signal a private package, which cannot be enforced, of course. > Assuming this rewriting would indeed avoid conflict with any standard > protobuf package a user might want to use, it would be nicer to do this > using a robust refactoring > tool. We could try Rope [1] (in particular, its folder refactoring library > method [2]) for this; it should add Rope only as a build dependency. Does > this sound worth trying? > > [1] https://github.com/python-rope/rope > [2] > https://rope.readthedocs.io/en/latest/library.html#list-of-refactorings > > > Yaron. > ________________________________ > From: Jeroen van Straten <jeroen.van.stra...@gmail.com> > Sent: Monday, July 4, 2022 12:38 PM > To: dev@arrow.apache.org <dev@arrow.apache.org> > Subject: Re: accessing Substrait protobuf Python classes from PyArrow > > Ah, I see. I guess that complicates things a little. > > About the exposure part in C++, the idea is that if we statically link > libprotobuf and use private linkage for the generated classes, *hopefully* > users won't run into linking issues when they link to both Arrow and to > some libprotobuf of their own, especially when they or another one of their > dependencies also includes Substrait protobuf. The classes would absolutely > conflict otherwise, and the libprotobuf version may or may not conflict. > The latter is kind of an unsolved problem; AFAIK there have been talks to > replace libprotobuf with a third-party alternative to avoid all this > nonsense, but nothing concrete yet. > > For Python, the problems are not quite the same: > > - There is no such thing as private linkage in Python, so we *have* to > re-namespace the generated Python files. Note that the output of protoc > assumes that the files are in exactly the same namespace/module > path/whatever as the proto files specify, so the toplevel module would be > "substrait", not "arrow". See > https://github.com/substrait-io/substrait/pull/207 and > > https://github.com/substrait-io/substrait-validator/blob/941b81ddf65d8ee6266d6eea0d151a3f9d8a512b/py/build.rs#L105-L142 > for my ugly workarounds for this thus far. > - There is also no such thing as static linkage in Python, so if you'd > want to use Google's protobuf implementation in Python pyarrow *will* need > to depend on it, and it'd become the user's problem to make sure that the > installed Python protobuf version matches the version of their system > library. It looks like pyarrow currently only depends on numpy, which is > pretty awesome... so I feel like we should keep it that way. > > Not sure what the best course of action is. > > Jeroen > > On Sun, 3 Jul 2022 at 22:55, Yaron Gvili <rt...@hotmail.com> wrote: > > > Thanks, the Google protobuf exposure concerns are clear. Another concern > > would be consistent maintenance of the same version of Substrait protobuf > > used in Arrow C++ and in PyArrow. > > > > I didn't mean access to users but internally. That is, I didn't mean to > > expose Substrait protobuf Python classes to PyArrow users, just to use > them > > internally in PyArrow code I'll be developing, per the use case I > described > > at the start of this thread. IIUC, this should address the exposure > > concerns, and the `arrow/engine/substrait` module in Arrow C++ does this > > too. If the technical approach I just described would actually expose the > > classes, what would be a proper way to avoid exposing them? Perhaps the > > classes should be generated into a private package, e.g., under > > `python/_ep`? (ep stands for external project) > > > > > > Yaron. > > ________________________________ > > From: Antoine Pitrou <anto...@python.org> > > Sent: Sunday, July 3, 2022 3:20 PM > > To: dev@arrow.apache.org <dev@arrow.apache.org> > > Subject: Re: accessing Substrait protobuf Python classes from PyArrow > > > > > > I agree that giving direct access to protobuf classes is not Arrow's > > job. You can probably take the upstream (i.e. Substrait's) protobuf > > definitions and compile them yourself, using whatever settings required > > by your project. > > > > Regards > > > > Antoine. > > > > > > Le 03/07/2022 à 21:16, Jeroen van Straten a écrit : > > > It's not so much about whether or not we *can*, but about whether or > not > > we > > > *should* generate and expose these files. > > > > > > Fundamentally, Substrait aims to be a means to connect different > systems > > > together by standardizing some interchange format. Currently that > happens > > > to be based on protobuf, so *one* (obvious) way to generate and > interpret > > > Substrait plans is to use Google's own protobuf implementation, as > > > generated by protoc for various languages. It's true that there's > nothing > > > fundamentally blocking Arrow from exposing those. > > > > > > However, it's by no means the *only* way to interact with protobuf > > > serializations, and Google's own implementation is a hot mess when it > > comes > > > to dependencies; there are lots of good reasons why you might not want > to > > > depend on it and opt for a third-party implementation instead. For one > > > thing, the Python wheel depends on system libraries beyond manylinux, > and > > > if they happen to be incompatible (which is likely) it just explodes > > unless > > > you set some environment variable. Therefore, assuming pyarrow doesn't > > > already depend on protobuf, I feel like we should keep it that way, and > > > thus that we should not include the generated Python files in the > > release. > > > Note that we don't even expose/leak the protoc-generated C++ classes > for > > > similar reasons. > > > > > > Also, as Weston already pointed out, it's not really our job; Substrait > > > aims to publish bindings for various languages by itself. It just > doesn't > > > expose them for Python *yet*. In the interim I suppose you could use > the > > > substrait-validator package from PyPI. It does expose them, as well as > > some > > > convenient conversion functions, but I'm having trouble finding people > to > > > help me keep the validator maintained. > > > > > > I suppose versioning would be difficult either way, since Substrait > > > semi-regularly pushes breaking changes and Arrow currently lags behind > by > > > several months (though I have a PR open for Substrait 0.6). I guess > from > > > that point of view distributing the right version along with pyarrow > > seems > > > nice, but the issues of Google's protobuf implementation remain. This > > being > > > an issue at all is also very much a Substrait problem, not an Arrow > > > problem; at best we can try to mitigate it. > > > > > > Jeroen > > > > > > On Sun, Jul 3, 2022, 17:51 Yaron Gvili <rt...@hotmail.com> wrote: > > > > > >> I looked into the Arrow build system some more. It is possible to get > > the > > >> Python classes generated by adding "--python-out" flag (set to a > > directory > > >> created for it) to the `${ARROW_PROTOBUF_PROTOC}` command under > > >> `macro(build_substrait)` in > > `cpp/cmake_modules/ThirdpartyToolchain.cmake`. > > >> However, this makes them available only in the Arrow C++ build whereas > > for > > >> the current purpose they need to be available in the PyArrow build. > The > > >> PyArrow build calls `cmake` on `python/CMakeLists.txt`, which AFAICS > has > > >> access to `cpp/cmake_modules`. So, one solution could be to pull > > >> `macro(build_substrait)` into `python/CMakeLists.txt` and call it to > > >> generate the Python protobuf classes under `python/`, making them > > available > > >> for import by PyArrow code. This would probably be cleaner with some > > macro > > >> parameters to distinguish between C++ and Python generation. > > >> > > >> Does this sound like a reasonable approach? > > >> > > >> > > >> Yaron. > > >> > > >> ________________________________ > > >> From: Yaron Gvili <rt...@hotmail.com> > > >> Sent: Saturday, July 2, 2022 8:55 AM > > >> To: dev@arrow.apache.org <dev@arrow.apache.org>; Phillip Cloud < > > >> cpcl...@gmail.com> > > >> Subject: Re: accessing Substrait protobuf Python classes from PyArrow > > >> > > >> I'm somewhat confused by this answer because I think resolving the > > issue I > > >> raised does not require any change outside PyArrow. I'll try to > explain > > the > > >> issue differently. > > >> > > >> First, let me describe the current situation with Substrait protobuf > in > > >> Arrow C++. The Arrow C++ build system handles external projects in > > >> `cpp/cmake_modules/ThirdpartyToolchain.cmake`, and one of these > external > > >> projects is "substrait". By default, the build system takes the source > > code > > >> for "substrait" from ` > > >> > > > https://github.com/substrait-io/substrait/archive/${ARROW_SUBSTRAIT_BUILD_VERSION}.tar.gz > > >> ` where `ARROW_SUBSTRAIT_BUILD_VERSION` is set in > > >> `cpp/thirdparty/versions.txt`. The source code is check-summed and > > unpacked > > >> in `substrait_ep-prefix` under the build directory and from this the > > >> protobuf C++ classes are generated in `*.pb.{h,cc}` files in > > >> `substrait_ep-generated` under the build directory. The build system > > makes > > >> a library using the `*.cc` files and makes the `*.h` files available > for > > >> other C++ modules to use. > > >> > > >> Setting up the above mechanism did not require any change in the > > >> `substrait-io/substrait` repo, nor any coordination with its authors. > > What > > >> I'm looking for is a similar build mechanism for PyArrow that builds > > >> Substrait protobuf Python classes and makes them available for use by > > other > > >> PyArrow modules. I believe this PyArrow build mechanism does not exist > > >> currently and that setting up one would not require any changes > outside > > >> PyArrow. I'm asking (1) whether that's indeed the case, (2) whether > > others > > >> agree this mechanism is needed at least due to the problem I ran into > > that > > >> I previously described, and (3) for any thoughts about how to set up > > this > > >> mechanism assuming it is needed. > > >> > > >> Weston, perhaps your thinking was that the Substrait protobuf Python > > >> classes need to be built by a repo in the substrait-io space and made > > >> available as a binary+headers package? This can be done but will > require > > >> involving Substrait people and appears to be inconsistent with current > > >> patterns in the Arrow build system. Note that for my purposes here, > the > > >> Substrait protobuf Python classes will be used for composing or > > >> interpreting a Substrait plan, not for transforming it by an > optimizer, > > >> though a Python-based optimizer is a valid use case for them. > > >> > > >> > > >> Yaron. > > >> ________________________________ > > >> From: Weston Pace <weston.p...@gmail.com> > > >> Sent: Friday, July 1, 2022 12:42 PM > > >> To: dev@arrow.apache.org <dev@arrow.apache.org>; Phillip Cloud < > > >> cpcl...@gmail.com> > > >> Subject: Re: accessing Substrait protobuf Python classes from PyArrow > > >> > > >> Given that Acero does not do any planner / optimizer type tasks I'm > > >> not sure you will find anything like this in arrow-cpp or pyarrow. > > >> What you are describing I sometimes refer to as "plan slicing and > > >> dicing". I have wondered if we will someday need this in Acero but I > > >> fear it is a slippery slope between "a little bit of plan > > >> manipulation" and "a full blown planner" so I've shied away from it. > > >> My first spot to look would be a substrait-python repository which > > >> would belong here: https://github.com/substrait-io > > >> > > >> However, it does not appear that such a repository exists. If you're > > >> willing to create one then a quick ask on the Substrait Slack instance > > >> should be enough to get the repository created. Perhaps there is some > > >> genesis of this library in Ibis although I think Ibis would use its > > >> own representation for slicing and dicing and only use Substrait for > > >> serialization. > > >> > > >> Once that repository is created pyarrow could probably import it but > > >> unless this plan manipulation makes sense purely from a pyarrow > > >> perspective I would rather prefer that the user application import > > >> both pyarrow and substrait-python independently. > > >> > > >> Perhaps @Phillip Cloud or someone from the Ibis space might have some > > >> ideas on where this might be found. > > >> > > >> -Weston > > >> > > >> On Thu, Jun 30, 2022 at 10:06 AM Yaron Gvili <rt...@hotmail.com> > wrote: > > >>> > > >>> Hi, > > >>> > > >>> Is there support for accessing Substrait protobuf Python classes > (such > > >> as Plan) from PyArrow? If not, how should such support be added? For > > >> example, should the PyArrow build system pull in the Substrait repo as > > an > > >> external project and build its protobuf Python classes, in a manner > > similar > > >> to how Arrow C++ does it? > > >>> > > >>> I'm pondering these questions after running into an issue with code > I'm > > >> writing under PyArrow that parses a Substrait plan represented as a > > >> dictionary. The current (and kind of shaky) parsing operation in this > > code > > >> uses json.dumps() on the dictionary, which results in a string that is > > >> passed to a Cython API that handles it using Arrow C++ code that has > > access > > >> to Substrait protobuf C++ classes. But when the Substrait plan > contains > > a > > >> bytes-type, json.dump() no longer works and fails with "TypeError: > > Object > > >> of type bytes is not JSON serializable". A fix for this, and a better > > way > > >> to parse, is using google.protobuf.json_format.ParseDict() [1] on the > > >> dictionary. However, this invocation requires a second argument, > namely > > a > > >> protobuf message instance to merge with. The class of this message > > (such as > > >> Plan) is a Substrait protobuf Python class, hence the need to access > > such > > >> classes from PyArrow. > > >>> > > >>> [1] > > >> > > > https://googleapis.dev/python/protobuf/latest/google/protobuf/json_format.html > > >>> > > >>> > > >>> Yaron. > > >> > > > > > >