If this is to be used within pyarrow then I think it makes sense as something to do. Can you expand a bit more on what you are trying to do? Why do you need to modify the Substrait plan in pyarrow?
> This came up in a data-source UDF scenario where the implementation is a > Python stream factory, i.e., a Python callable returning a callable. I think > it would be easier to get the factory and then the stream using Python code > than using Cython or C++. I'm not quite certain why this requires a modification to the plan. On Tue, Jul 5, 2022 at 7:45 AM Yaron Gvili <rt...@hotmail.com> wrote: > > @Li, yes though in a new way. This came up in a data-source UDF scenario > where the implementation is a Python stream factory, i.e., a Python callable > returning a callable. I think it would be easier to get the factory and then > the stream using Python code than using Cython or C++. > > > Yaron. > > ________________________________ > From: Li Jin <ice.xell...@gmail.com> > Sent: Tuesday, July 5, 2022, 4:22 PM > To: dev@arrow.apache.org <dev@arrow.apache.org> > Subject: Re: accessing Substrait protobuf Python classes from PyArrow > > Yaron, do we need to parse the subtrait protobuf in Python so that we can > get the UDFs and register them with Pyarrow? > > On Mon, Jul 4, 2022 at 1:24 PM Yaron Gvili <rt...@hotmail.com> wrote: > > > This rewriting of the package is basically what I had in mind; the `_ep` > > was just to signal a private package, which cannot be enforced, of course. > > Assuming this rewriting would indeed avoid conflict with any standard > > protobuf package a user might want to use, it would be nicer to do this > > using a robust refactoring > > tool. We could try Rope [1] (in particular, its folder refactoring library > > method [2]) for this; it should add Rope only as a build dependency. Does > > this sound worth trying? > > > > [1] https://github.com/python-rope/rope > > [2] > > https://rope.readthedocs.io/en/latest/library.html#list-of-refactorings > > > > > > Yaron. > > ________________________________ > > From: Jeroen van Straten <jeroen.van.stra...@gmail.com> > > Sent: Monday, July 4, 2022 12:38 PM > > To: dev@arrow.apache.org <dev@arrow.apache.org> > > Subject: Re: accessing Substrait protobuf Python classes from PyArrow > > > > Ah, I see. I guess that complicates things a little. > > > > About the exposure part in C++, the idea is that if we statically link > > libprotobuf and use private linkage for the generated classes, *hopefully* > > users won't run into linking issues when they link to both Arrow and to > > some libprotobuf of their own, especially when they or another one of their > > dependencies also includes Substrait protobuf. The classes would absolutely > > conflict otherwise, and the libprotobuf version may or may not conflict. > > The latter is kind of an unsolved problem; AFAIK there have been talks to > > replace libprotobuf with a third-party alternative to avoid all this > > nonsense, but nothing concrete yet. > > > > For Python, the problems are not quite the same: > > > > - There is no such thing as private linkage in Python, so we *have* to > > re-namespace the generated Python files. Note that the output of protoc > > assumes that the files are in exactly the same namespace/module > > path/whatever as the proto files specify, so the toplevel module would be > > "substrait", not "arrow". See > > https://github.com/substrait-io/substrait/pull/207 and > > > > https://github.com/substrait-io/substrait-validator/blob/941b81ddf65d8ee6266d6eea0d151a3f9d8a512b/py/build.rs#L105-L142 > > for my ugly workarounds for this thus far. > > - There is also no such thing as static linkage in Python, so if you'd > > want to use Google's protobuf implementation in Python pyarrow *will* need > > to depend on it, and it'd become the user's problem to make sure that the > > installed Python protobuf version matches the version of their system > > library. It looks like pyarrow currently only depends on numpy, which is > > pretty awesome... so I feel like we should keep it that way. > > > > Not sure what the best course of action is. > > > > Jeroen > > > > On Sun, 3 Jul 2022 at 22:55, Yaron Gvili <rt...@hotmail.com> wrote: > > > > > Thanks, the Google protobuf exposure concerns are clear. Another concern > > > would be consistent maintenance of the same version of Substrait protobuf > > > used in Arrow C++ and in PyArrow. > > > > > > I didn't mean access to users but internally. That is, I didn't mean to > > > expose Substrait protobuf Python classes to PyArrow users, just to use > > them > > > internally in PyArrow code I'll be developing, per the use case I > > described > > > at the start of this thread. IIUC, this should address the exposure > > > concerns, and the `arrow/engine/substrait` module in Arrow C++ does this > > > too. If the technical approach I just described would actually expose the > > > classes, what would be a proper way to avoid exposing them? Perhaps the > > > classes should be generated into a private package, e.g., under > > > `python/_ep`? (ep stands for external project) > > > > > > > > > Yaron. > > > ________________________________ > > > From: Antoine Pitrou <anto...@python.org> > > > Sent: Sunday, July 3, 2022 3:20 PM > > > To: dev@arrow.apache.org <dev@arrow.apache.org> > > > Subject: Re: accessing Substrait protobuf Python classes from PyArrow > > > > > > > > > I agree that giving direct access to protobuf classes is not Arrow's > > > job. You can probably take the upstream (i.e. Substrait's) protobuf > > > definitions and compile them yourself, using whatever settings required > > > by your project. > > > > > > Regards > > > > > > Antoine. > > > > > > > > > Le 03/07/2022 à 21:16, Jeroen van Straten a écrit : > > > > It's not so much about whether or not we *can*, but about whether or > > not > > > we > > > > *should* generate and expose these files. > > > > > > > > Fundamentally, Substrait aims to be a means to connect different > > systems > > > > together by standardizing some interchange format. Currently that > > happens > > > > to be based on protobuf, so *one* (obvious) way to generate and > > interpret > > > > Substrait plans is to use Google's own protobuf implementation, as > > > > generated by protoc for various languages. It's true that there's > > nothing > > > > fundamentally blocking Arrow from exposing those. > > > > > > > > However, it's by no means the *only* way to interact with protobuf > > > > serializations, and Google's own implementation is a hot mess when it > > > comes > > > > to dependencies; there are lots of good reasons why you might not want > > to > > > > depend on it and opt for a third-party implementation instead. For one > > > > thing, the Python wheel depends on system libraries beyond manylinux, > > and > > > > if they happen to be incompatible (which is likely) it just explodes > > > unless > > > > you set some environment variable. Therefore, assuming pyarrow doesn't > > > > already depend on protobuf, I feel like we should keep it that way, and > > > > thus that we should not include the generated Python files in the > > > release. > > > > Note that we don't even expose/leak the protoc-generated C++ classes > > for > > > > similar reasons. > > > > > > > > Also, as Weston already pointed out, it's not really our job; Substrait > > > > aims to publish bindings for various languages by itself. It just > > doesn't > > > > expose them for Python *yet*. In the interim I suppose you could use > > the > > > > substrait-validator package from PyPI. It does expose them, as well as > > > some > > > > convenient conversion functions, but I'm having trouble finding people > > to > > > > help me keep the validator maintained. > > > > > > > > I suppose versioning would be difficult either way, since Substrait > > > > semi-regularly pushes breaking changes and Arrow currently lags behind > > by > > > > several months (though I have a PR open for Substrait 0.6). I guess > > from > > > > that point of view distributing the right version along with pyarrow > > > seems > > > > nice, but the issues of Google's protobuf implementation remain. This > > > being > > > > an issue at all is also very much a Substrait problem, not an Arrow > > > > problem; at best we can try to mitigate it. > > > > > > > > Jeroen > > > > > > > > On Sun, Jul 3, 2022, 17:51 Yaron Gvili <rt...@hotmail.com> wrote: > > > > > > > >> I looked into the Arrow build system some more. It is possible to get > > > the > > > >> Python classes generated by adding "--python-out" flag (set to a > > > directory > > > >> created for it) to the `${ARROW_PROTOBUF_PROTOC}` command under > > > >> `macro(build_substrait)` in > > > `cpp/cmake_modules/ThirdpartyToolchain.cmake`. > > > >> However, this makes them available only in the Arrow C++ build whereas > > > for > > > >> the current purpose they need to be available in the PyArrow build. > > The > > > >> PyArrow build calls `cmake` on `python/CMakeLists.txt`, which AFAICS > > has > > > >> access to `cpp/cmake_modules`. So, one solution could be to pull > > > >> `macro(build_substrait)` into `python/CMakeLists.txt` and call it to > > > >> generate the Python protobuf classes under `python/`, making them > > > available > > > >> for import by PyArrow code. This would probably be cleaner with some > > > macro > > > >> parameters to distinguish between C++ and Python generation. > > > >> > > > >> Does this sound like a reasonable approach? > > > >> > > > >> > > > >> Yaron. > > > >> > > > >> ________________________________ > > > >> From: Yaron Gvili <rt...@hotmail.com> > > > >> Sent: Saturday, July 2, 2022 8:55 AM > > > >> To: dev@arrow.apache.org <dev@arrow.apache.org>; Phillip Cloud < > > > >> cpcl...@gmail.com> > > > >> Subject: Re: accessing Substrait protobuf Python classes from PyArrow > > > >> > > > >> I'm somewhat confused by this answer because I think resolving the > > > issue I > > > >> raised does not require any change outside PyArrow. I'll try to > > explain > > > the > > > >> issue differently. > > > >> > > > >> First, let me describe the current situation with Substrait protobuf > > in > > > >> Arrow C++. The Arrow C++ build system handles external projects in > > > >> `cpp/cmake_modules/ThirdpartyToolchain.cmake`, and one of these > > external > > > >> projects is "substrait". By default, the build system takes the source > > > code > > > >> for "substrait" from ` > > > >> > > > > > https://github.com/substrait-io/substrait/archive/${ARROW_SUBSTRAIT_BUILD_VERSION}.tar.gz > > > >> ` where `ARROW_SUBSTRAIT_BUILD_VERSION` is set in > > > >> `cpp/thirdparty/versions.txt`. The source code is check-summed and > > > unpacked > > > >> in `substrait_ep-prefix` under the build directory and from this the > > > >> protobuf C++ classes are generated in `*.pb.{h,cc}` files in > > > >> `substrait_ep-generated` under the build directory. The build system > > > makes > > > >> a library using the `*.cc` files and makes the `*.h` files available > > for > > > >> other C++ modules to use. > > > >> > > > >> Setting up the above mechanism did not require any change in the > > > >> `substrait-io/substrait` repo, nor any coordination with its authors. > > > What > > > >> I'm looking for is a similar build mechanism for PyArrow that builds > > > >> Substrait protobuf Python classes and makes them available for use by > > > other > > > >> PyArrow modules. I believe this PyArrow build mechanism does not exist > > > >> currently and that setting up one would not require any changes > > outside > > > >> PyArrow. I'm asking (1) whether that's indeed the case, (2) whether > > > others > > > >> agree this mechanism is needed at least due to the problem I ran into > > > that > > > >> I previously described, and (3) for any thoughts about how to set up > > > this > > > >> mechanism assuming it is needed. > > > >> > > > >> Weston, perhaps your thinking was that the Substrait protobuf Python > > > >> classes need to be built by a repo in the substrait-io space and made > > > >> available as a binary+headers package? This can be done but will > > require > > > >> involving Substrait people and appears to be inconsistent with current > > > >> patterns in the Arrow build system. Note that for my purposes here, > > the > > > >> Substrait protobuf Python classes will be used for composing or > > > >> interpreting a Substrait plan, not for transforming it by an > > optimizer, > > > >> though a Python-based optimizer is a valid use case for them. > > > >> > > > >> > > > >> Yaron. > > > >> ________________________________ > > > >> From: Weston Pace <weston.p...@gmail.com> > > > >> Sent: Friday, July 1, 2022 12:42 PM > > > >> To: dev@arrow.apache.org <dev@arrow.apache.org>; Phillip Cloud < > > > >> cpcl...@gmail.com> > > > >> Subject: Re: accessing Substrait protobuf Python classes from PyArrow > > > >> > > > >> Given that Acero does not do any planner / optimizer type tasks I'm > > > >> not sure you will find anything like this in arrow-cpp or pyarrow. > > > >> What you are describing I sometimes refer to as "plan slicing and > > > >> dicing". I have wondered if we will someday need this in Acero but I > > > >> fear it is a slippery slope between "a little bit of plan > > > >> manipulation" and "a full blown planner" so I've shied away from it. > > > >> My first spot to look would be a substrait-python repository which > > > >> would belong here: https://github.com/substrait-io > > > >> > > > >> However, it does not appear that such a repository exists. If you're > > > >> willing to create one then a quick ask on the Substrait Slack instance > > > >> should be enough to get the repository created. Perhaps there is some > > > >> genesis of this library in Ibis although I think Ibis would use its > > > >> own representation for slicing and dicing and only use Substrait for > > > >> serialization. > > > >> > > > >> Once that repository is created pyarrow could probably import it but > > > >> unless this plan manipulation makes sense purely from a pyarrow > > > >> perspective I would rather prefer that the user application import > > > >> both pyarrow and substrait-python independently. > > > >> > > > >> Perhaps @Phillip Cloud or someone from the Ibis space might have some > > > >> ideas on where this might be found. > > > >> > > > >> -Weston > > > >> > > > >> On Thu, Jun 30, 2022 at 10:06 AM Yaron Gvili <rt...@hotmail.com> > > wrote: > > > >>> > > > >>> Hi, > > > >>> > > > >>> Is there support for accessing Substrait protobuf Python classes > > (such > > > >> as Plan) from PyArrow? If not, how should such support be added? For > > > >> example, should the PyArrow build system pull in the Substrait repo as > > > an > > > >> external project and build its protobuf Python classes, in a manner > > > similar > > > >> to how Arrow C++ does it? > > > >>> > > > >>> I'm pondering these questions after running into an issue with code > > I'm > > > >> writing under PyArrow that parses a Substrait plan represented as a > > > >> dictionary. The current (and kind of shaky) parsing operation in this > > > code > > > >> uses json.dumps() on the dictionary, which results in a string that is > > > >> passed to a Cython API that handles it using Arrow C++ code that has > > > access > > > >> to Substrait protobuf C++ classes. But when the Substrait plan > > contains > > > a > > > >> bytes-type, json.dump() no longer works and fails with "TypeError: > > > Object > > > >> of type bytes is not JSON serializable". A fix for this, and a better > > > way > > > >> to parse, is using google.protobuf.json_format.ParseDict() [1] on the > > > >> dictionary. However, this invocation requires a second argument, > > namely > > > a > > > >> protobuf message instance to merge with. The class of this message > > > (such as > > > >> Plan) is a Substrait protobuf Python class, hence the need to access > > > such > > > >> classes from PyArrow. > > > >>> > > > >>> [1] > > > >> > > > > > https://googleapis.dev/python/protobuf/latest/google/protobuf/json_format.html > > > >>> > > > >>> > > > >>> Yaron. > > > >> > > > > > > > > > >