Hi Yue,

>From my limited experience with the execution engine, my understanding is that 
>the API allows streaming only an ExecBatch from one node to another. A 
>possible solution is to derive from ExecBatch your own class (say) 
>RichExecBatch that carries any extra metadata you want. If in your execution 
>plan, each node that expects to receive a RichExecBatch gets it directly from 
>a sending node that makes it (both of which you could implement), then I think 
>this could work and may be enough for your use case. However, note that when 
>there are intermediate nodes in between such sending and receiving nodes, this 
>may well break because an intermediate node could output a fresh ExecBatch 
>even when receiving a RichExecBatch as input, like filter_node does [1], for 
>example.

[1] 
https://github.com/apache/arrow/blob/35119f29b0e0de68b1ccc5f2066e0cc7d27fddd0/cpp/src/arrow/compute/exec/filter_node.cc#L98


Yaron.

________________________________
From: Yue Ni <niyue....@gmail.com>
Sent: Monday, May 9, 2022 10:28 AM
To: dev@arrow.apache.org <dev@arrow.apache.org>
Subject: ExecBatch in arrow execution engine

Hi there,

I would like to use apache arrow execution engine for some computation. I
found `ExecBatch` instead of `RecordBatch` is used for execution engine's
node, and I wonder how I can attach some additional information such as
schema/metadata for the `ExecBatch` during execution so that they can be
used by a custom ExecNode.

In my first use case, the computation flow looks like this:

scanner <===> custom filter node <===> query client

1) The scanner is a custom scanner that will load some data from disk, and
it accepts a pushed down custom filter expression (not the arrow filter
expression but a homebrewed filter expression), and the scanner will use
this custom filter expression to avoid loading data from disk as much as
possible but it may return a superset of matching data to the successor
nodes because the capability of pushed down filter.

2) And its successor node is a filter node, which will do some additional
filtering if needed. The scanner is aware that if a result batch retrieved
needs additional filtering or not, and I would like to make scanner pass
some batch specific metadata like "additional_filtering_required:
true/false" along with the batch to the filter node, but I cannot figure
out how this could be done for the `ExecBatch`.

In my other use case, I would like to attach a batch specific schema to
each batch returned by some nodes.

Basically, I wonder within the current framework, if there is any chance I
could attach some additional execution metadata/schema to the `ExecBatch`
so that they could be used by a custom exec node. Could you please help?
Thanks.

Reply via email to