If this is to be used within pyarrow then I think it makes sense as
something to do.  Can you expand a bit more on what you are trying to
do?  Why do you need to modify the Substrait plan in pyarrow?

> This came up in a data-source UDF scenario where the implementation is a 
> Python stream factory, i.e., a Python callable returning a callable. I think 
> it would be easier to get the factory and then the stream using Python code 
> than using Cython or C++.

I'm not quite certain why this requires a modification to the plan.


On Tue, Jul 5, 2022 at 7:45 AM Yaron Gvili <rt...@hotmail.com> wrote:
>
> @Li, yes though in a new way. This came up in a data-source UDF scenario 
> where the implementation is a Python stream factory, i.e., a Python callable 
> returning a callable. I think it would be easier to get the factory and then 
> the stream using Python code than using Cython or C++.
>
>
> Yaron.
>
> ________________________________
> From: Li Jin <ice.xell...@gmail.com>
> Sent: Tuesday, July 5, 2022, 4:22 PM
> To: dev@arrow.apache.org <dev@arrow.apache.org>
> Subject: Re: accessing Substrait protobuf Python classes from PyArrow
>
> Yaron, do we need to parse the subtrait protobuf in Python so that we can
> get the UDFs and register them with Pyarrow?
>
> On Mon, Jul 4, 2022 at 1:24 PM Yaron Gvili <rt...@hotmail.com> wrote:
>
> > This rewriting of the package is basically what I had in mind; the `_ep`
> > was just to signal a private package, which cannot be enforced, of course.
> > Assuming this rewriting would indeed avoid conflict with any standard
> > protobuf package a user might want to use, it would be nicer to do this
> > using a robust refactoring
> > tool. We could try Rope [1] (in particular, its folder refactoring library
> > method [2]) for this; it should add Rope only as a build dependency. Does
> > this sound worth trying?
> >
> > [1] https://github.com/python-rope/rope
> > [2]
> > https://rope.readthedocs.io/en/latest/library.html#list-of-refactorings
> >
> >
> > Yaron.
> > ________________________________
> > From: Jeroen van Straten <jeroen.van.stra...@gmail.com>
> > Sent: Monday, July 4, 2022 12:38 PM
> > To: dev@arrow.apache.org <dev@arrow.apache.org>
> > Subject: Re: accessing Substrait protobuf Python classes from PyArrow
> >
> > Ah, I see. I guess that complicates things a little.
> >
> > About the exposure part in C++, the idea is that if we statically link
> > libprotobuf and use private linkage for the generated classes, *hopefully*
> > users won't run into linking issues when they link to both Arrow and to
> > some libprotobuf of their own, especially when they or another one of their
> > dependencies also includes Substrait protobuf. The classes would absolutely
> > conflict otherwise, and the libprotobuf version may or may not conflict.
> > The latter is kind of an unsolved problem; AFAIK there have been talks to
> > replace libprotobuf with a third-party alternative to avoid all this
> > nonsense, but nothing concrete yet.
> >
> > For Python, the problems are not quite the same:
> >
> >  - There is no such thing as private linkage in Python, so we *have* to
> > re-namespace the generated Python files. Note that the output of protoc
> > assumes that the files are in exactly the same namespace/module
> > path/whatever as the proto files specify, so the toplevel module would be
> > "substrait", not "arrow". See
> > https://github.com/substrait-io/substrait/pull/207 and
> >
> > https://github.com/substrait-io/substrait-validator/blob/941b81ddf65d8ee6266d6eea0d151a3f9d8a512b/py/build.rs#L105-L142
> > for my ugly workarounds for this thus far.
> >  - There is also no such thing as static linkage in Python, so if you'd
> > want to use Google's protobuf implementation in Python pyarrow *will* need
> > to depend on it, and it'd become the user's problem to make sure that the
> > installed Python protobuf version matches the version of their system
> > library. It looks like pyarrow currently only depends on numpy, which is
> > pretty awesome... so I feel like we should keep it that way.
> >
> > Not sure what the best course of action is.
> >
> > Jeroen
> >
> > On Sun, 3 Jul 2022 at 22:55, Yaron Gvili <rt...@hotmail.com> wrote:
> >
> > > Thanks, the Google protobuf exposure concerns are clear. Another concern
> > > would be consistent maintenance of the same version of Substrait protobuf
> > > used in Arrow C++ and in PyArrow.
> > >
> > > I didn't mean access to users but internally. That is, I didn't mean to
> > > expose Substrait protobuf Python classes to PyArrow users, just to use
> > them
> > > internally in PyArrow code I'll be developing, per the use case I
> > described
> > > at the start of this thread. IIUC, this should address the exposure
> > > concerns, and the `arrow/engine/substrait` module in Arrow C++ does this
> > > too. If the technical approach I just described would actually expose the
> > > classes, what would be a proper way to avoid exposing them? Perhaps the
> > > classes should be generated into a private package, e.g., under
> > > `python/_ep`? (ep stands for external project)
> > >
> > >
> > > Yaron.
> > > ________________________________
> > > From: Antoine Pitrou <anto...@python.org>
> > > Sent: Sunday, July 3, 2022 3:20 PM
> > > To: dev@arrow.apache.org <dev@arrow.apache.org>
> > > Subject: Re: accessing Substrait protobuf Python classes from PyArrow
> > >
> > >
> > > I agree that giving direct access to protobuf classes is not Arrow's
> > > job. You can probably take the upstream (i.e. Substrait's) protobuf
> > > definitions and compile them yourself, using whatever settings required
> > > by your project.
> > >
> > > Regards
> > >
> > > Antoine.
> > >
> > >
> > > Le 03/07/2022 à 21:16, Jeroen van Straten a écrit :
> > > > It's not so much about whether or not we *can*, but about whether or
> > not
> > > we
> > > > *should* generate and expose these files.
> > > >
> > > > Fundamentally, Substrait aims to be a means to connect different
> > systems
> > > > together by standardizing some interchange format. Currently that
> > happens
> > > > to be based on protobuf, so *one* (obvious) way to generate and
> > interpret
> > > > Substrait plans is to use Google's own protobuf implementation, as
> > > > generated by protoc for various languages. It's true that there's
> > nothing
> > > > fundamentally blocking Arrow from exposing those.
> > > >
> > > > However, it's by no means the *only* way to interact with protobuf
> > > > serializations, and Google's own implementation is a hot mess when it
> > > comes
> > > > to dependencies; there are lots of good reasons why you might not want
> > to
> > > > depend on it and opt for a third-party implementation instead. For one
> > > > thing, the Python wheel depends on system libraries beyond manylinux,
> > and
> > > > if they happen to be incompatible (which is likely) it just explodes
> > > unless
> > > > you set some environment variable. Therefore, assuming pyarrow doesn't
> > > > already depend on protobuf, I feel like we should keep it that way, and
> > > > thus that we should not include the generated Python files in the
> > > release.
> > > > Note that we don't even expose/leak the protoc-generated C++ classes
> > for
> > > > similar reasons.
> > > >
> > > > Also, as Weston already pointed out, it's not really our job; Substrait
> > > > aims to publish bindings for various languages by itself. It just
> > doesn't
> > > > expose them for Python *yet*. In the interim I suppose you could use
> > the
> > > > substrait-validator package from PyPI. It does expose them, as well as
> > > some
> > > > convenient conversion functions, but I'm having trouble finding people
> > to
> > > > help me keep the validator maintained.
> > > >
> > > > I suppose versioning would be difficult either way, since Substrait
> > > > semi-regularly pushes breaking changes and Arrow currently lags behind
> > by
> > > > several months (though I have a PR open for Substrait 0.6). I guess
> > from
> > > > that point of view distributing the right version along with pyarrow
> > > seems
> > > > nice, but the issues of Google's protobuf implementation remain. This
> > > being
> > > > an issue at all is also very much a Substrait problem, not an Arrow
> > > > problem; at best we can try to mitigate it.
> > > >
> > > > Jeroen
> > > >
> > > > On Sun, Jul 3, 2022, 17:51 Yaron Gvili <rt...@hotmail.com> wrote:
> > > >
> > > >> I looked into the Arrow build system some more. It is possible to get
> > > the
> > > >> Python classes generated by adding "--python-out" flag (set to a
> > > directory
> > > >> created for it) to the `${ARROW_PROTOBUF_PROTOC}` command under
> > > >> `macro(build_substrait)` in
> > > `cpp/cmake_modules/ThirdpartyToolchain.cmake`.
> > > >> However, this makes them available only in the Arrow C++ build whereas
> > > for
> > > >> the current purpose they need to be available in the PyArrow build.
> > The
> > > >> PyArrow build calls `cmake` on `python/CMakeLists.txt`, which AFAICS
> > has
> > > >> access to `cpp/cmake_modules`. So, one solution could be to pull
> > > >> `macro(build_substrait)` into `python/CMakeLists.txt` and call it to
> > > >> generate the Python protobuf classes under `python/`, making them
> > > available
> > > >> for import by PyArrow code. This would probably be cleaner with some
> > > macro
> > > >> parameters to distinguish between C++ and Python generation.
> > > >>
> > > >> Does this sound like a reasonable approach?
> > > >>
> > > >>
> > > >> Yaron.
> > > >>
> > > >> ________________________________
> > > >> From: Yaron Gvili <rt...@hotmail.com>
> > > >> Sent: Saturday, July 2, 2022 8:55 AM
> > > >> To: dev@arrow.apache.org <dev@arrow.apache.org>; Phillip Cloud <
> > > >> cpcl...@gmail.com>
> > > >> Subject: Re: accessing Substrait protobuf Python classes from PyArrow
> > > >>
> > > >> I'm somewhat confused by this answer because I think resolving the
> > > issue I
> > > >> raised does not require any change outside PyArrow. I'll try to
> > explain
> > > the
> > > >> issue differently.
> > > >>
> > > >> First, let me describe the current situation with Substrait protobuf
> > in
> > > >> Arrow C++. The Arrow C++ build system handles external projects in
> > > >> `cpp/cmake_modules/ThirdpartyToolchain.cmake`, and one of these
> > external
> > > >> projects is "substrait". By default, the build system takes the source
> > > code
> > > >> for "substrait" from `
> > > >>
> > >
> > https://github.com/substrait-io/substrait/archive/${ARROW_SUBSTRAIT_BUILD_VERSION}.tar.gz
> > > >> ` where `ARROW_SUBSTRAIT_BUILD_VERSION` is set in
> > > >> `cpp/thirdparty/versions.txt`. The source code is check-summed and
> > > unpacked
> > > >> in `substrait_ep-prefix` under the build directory and from this the
> > > >> protobuf C++ classes are generated in `*.pb.{h,cc}` files in
> > > >> `substrait_ep-generated` under the build directory. The build system
> > > makes
> > > >> a library using the `*.cc` files and makes the `*.h` files available
> > for
> > > >> other C++ modules to use.
> > > >>
> > > >> Setting up the above mechanism did not require any change in the
> > > >> `substrait-io/substrait` repo, nor any coordination with its authors.
> > > What
> > > >> I'm looking for is a similar build mechanism for PyArrow that builds
> > > >> Substrait protobuf Python classes and makes them available for use by
> > > other
> > > >> PyArrow modules. I believe this PyArrow build mechanism does not exist
> > > >> currently and that setting up one would not require any changes
> > outside
> > > >> PyArrow. I'm asking (1) whether that's indeed the case, (2) whether
> > > others
> > > >> agree this mechanism is needed at least due to the problem I ran into
> > > that
> > > >> I previously described, and (3) for any thoughts about how to set up
> > > this
> > > >> mechanism assuming it is needed.
> > > >>
> > > >> Weston, perhaps your thinking was that the Substrait protobuf Python
> > > >> classes need to be built by a repo in the substrait-io space and made
> > > >> available as a binary+headers package? This can be done but will
> > require
> > > >> involving Substrait people and appears to be inconsistent with current
> > > >> patterns in the Arrow build system. Note that for my purposes here,
> > the
> > > >> Substrait protobuf Python classes will be used for composing or
> > > >> interpreting a Substrait plan, not for transforming it by an
> > optimizer,
> > > >> though a Python-based optimizer is a valid use case for them.
> > > >>
> > > >>
> > > >> Yaron.
> > > >> ________________________________
> > > >> From: Weston Pace <weston.p...@gmail.com>
> > > >> Sent: Friday, July 1, 2022 12:42 PM
> > > >> To: dev@arrow.apache.org <dev@arrow.apache.org>; Phillip Cloud <
> > > >> cpcl...@gmail.com>
> > > >> Subject: Re: accessing Substrait protobuf Python classes from PyArrow
> > > >>
> > > >> Given that Acero does not do any planner / optimizer type tasks I'm
> > > >> not sure you will find anything like this in arrow-cpp or pyarrow.
> > > >> What you are describing I sometimes refer to as "plan slicing and
> > > >> dicing".  I have wondered if we will someday need this in Acero but I
> > > >> fear it is a slippery slope between "a little bit of plan
> > > >> manipulation" and "a full blown planner" so I've shied away from it.
> > > >> My first spot to look would be a substrait-python repository which
> > > >> would belong here: https://github.com/substrait-io
> > > >>
> > > >> However, it does not appear that such a repository exists.  If you're
> > > >> willing to create one then a quick ask on the Substrait Slack instance
> > > >> should be enough to get the repository created.  Perhaps there is some
> > > >> genesis of this library in Ibis although I think Ibis would use its
> > > >> own representation for slicing and dicing and only use Substrait for
> > > >> serialization.
> > > >>
> > > >> Once that repository is created pyarrow could probably import it but
> > > >> unless this plan manipulation makes sense purely from a pyarrow
> > > >> perspective I would rather prefer that the user application import
> > > >> both pyarrow and substrait-python independently.
> > > >>
> > > >> Perhaps @Phillip Cloud or someone from the Ibis space might have some
> > > >> ideas on where this might be found.
> > > >>
> > > >> -Weston
> > > >>
> > > >> On Thu, Jun 30, 2022 at 10:06 AM Yaron Gvili <rt...@hotmail.com>
> > wrote:
> > > >>>
> > > >>> Hi,
> > > >>>
> > > >>> Is there support for accessing Substrait protobuf Python classes
> > (such
> > > >> as Plan) from PyArrow? If not, how should such support be added? For
> > > >> example, should the PyArrow build system pull in the Substrait repo as
> > > an
> > > >> external project and build its protobuf Python classes, in a manner
> > > similar
> > > >> to how Arrow C++ does it?
> > > >>>
> > > >>> I'm pondering these questions after running into an issue with code
> > I'm
> > > >> writing under PyArrow that parses a Substrait plan represented as a
> > > >> dictionary. The current (and kind of shaky) parsing operation in this
> > > code
> > > >> uses json.dumps() on the dictionary, which results in a string that is
> > > >> passed to a Cython API that handles it using Arrow C++ code that has
> > > access
> > > >> to Substrait protobuf C++ classes. But when the Substrait plan
> > contains
> > > a
> > > >> bytes-type, json.dump() no longer works and fails with "TypeError:
> > > Object
> > > >> of type bytes is not JSON serializable". A fix for this, and a better
> > > way
> > > >> to parse, is using google.protobuf.json_format.ParseDict() [1] on the
> > > >> dictionary. However, this invocation requires a second argument,
> > namely
> > > a
> > > >> protobuf message instance to merge with. The class of this message
> > > (such as
> > > >> Plan) is a Substrait protobuf Python class, hence the need to access
> > > such
> > > >> classes from PyArrow.
> > > >>>
> > > >>> [1]
> > > >>
> > >
> > https://googleapis.dev/python/protobuf/latest/google/protobuf/json_format.html
> > > >>>
> > > >>>
> > > >>> Yaron.
> > > >>
> > > >
> > >
> >
>

Reply via email to