Given that Acero does not do any planner / optimizer type tasks I'm not sure you will find anything like this in arrow-cpp or pyarrow. What you are describing I sometimes refer to as "plan slicing and dicing". I have wondered if we will someday need this in Acero but I fear it is a slippery slope between "a little bit of plan manipulation" and "a full blown planner" so I've shied away from it. My first spot to look would be a substrait-python repository which would belong here: https://github.com/substrait-io
However, it does not appear that such a repository exists. If you're willing to create one then a quick ask on the Substrait Slack instance should be enough to get the repository created. Perhaps there is some genesis of this library in Ibis although I think Ibis would use its own representation for slicing and dicing and only use Substrait for serialization. Once that repository is created pyarrow could probably import it but unless this plan manipulation makes sense purely from a pyarrow perspective I would rather prefer that the user application import both pyarrow and substrait-python independently. Perhaps @Phillip Cloud or someone from the Ibis space might have some ideas on where this might be found. -Weston On Thu, Jun 30, 2022 at 10:06 AM Yaron Gvili <rt...@hotmail.com> wrote: > > Hi, > > Is there support for accessing Substrait protobuf Python classes (such as > Plan) from PyArrow? If not, how should such support be added? For example, > should the PyArrow build system pull in the Substrait repo as an external > project and build its protobuf Python classes, in a manner similar to how > Arrow C++ does it? > > I'm pondering these questions after running into an issue with code I'm > writing under PyArrow that parses a Substrait plan represented as a > dictionary. The current (and kind of shaky) parsing operation in this code > uses json.dumps() on the dictionary, which results in a string that is passed > to a Cython API that handles it using Arrow C++ code that has access to > Substrait protobuf C++ classes. But when the Substrait plan contains a > bytes-type, json.dump() no longer works and fails with "TypeError: Object of > type bytes is not JSON serializable". A fix for this, and a better way to > parse, is using google.protobuf.json_format.ParseDict() [1] on the > dictionary. However, this invocation requires a second argument, namely a > protobuf message instance to merge with. The class of this message (such as > Plan) is a Substrait protobuf Python class, hence the need to access such > classes from PyArrow. > > [1] > https://googleapis.dev/python/protobuf/latest/google/protobuf/json_format.html > > > Yaron.