Any kind of "batch-level" information is a little tricky in the
execution engine because nodes are free to chop up and recombine
batches as they see fit.  For example, the output of a join node is
going to contain data from at least two different input batches.  Even
nodes with a single input and single output could be splitting batches
into smaller work items or accumulating batches into larger work
items.  A few thoughts come to mind:

Does the existing filter "guarantee" mechanism work for you?  Each
batch can be attached an expression which is guaranteed to be true.
The filter node uses this expression to simplify the filter it needs
to apply.  For example, if your custom scanner determines that `x >
50` is always true then that can be attached as a guarantee.  Later,
if you need to apply the filter `x < 30` then the filter node knows it
can exclude the entire batch based on the guarantee.  However, the
guarantee suffers from the above described "batch-level" problems
(e.g. a join node will not include guarantees in the output).

Can you attach your metadata as an actual column using a scalar?  This
is what we do with the __filename column today.

On Mon, May 9, 2022 at 5:24 AM Yaron Gvili <rt...@hotmail.com> wrote:
>
> Hi Yue,
>
> From my limited experience with the execution engine, my understanding is 
> that the API allows streaming only an ExecBatch from one node to another. A 
> possible solution is to derive from ExecBatch your own class (say) 
> RichExecBatch that carries any extra metadata you want. If in your execution 
> plan, each node that expects to receive a RichExecBatch gets it directly from 
> a sending node that makes it (both of which you could implement), then I 
> think this could work and may be enough for your use case. However, note that 
> when there are intermediate nodes in between such sending and receiving 
> nodes, this may well break because an intermediate node could output a fresh 
> ExecBatch even when receiving a RichExecBatch as input, like filter_node does 
> [1], for example.
>
> [1] 
> https://github.com/apache/arrow/blob/35119f29b0e0de68b1ccc5f2066e0cc7d27fddd0/cpp/src/arrow/compute/exec/filter_node.cc#L98
>
>
> Yaron.
>
> ________________________________
> From: Yue Ni <niyue....@gmail.com>
> Sent: Monday, May 9, 2022 10:28 AM
> To: dev@arrow.apache.org <dev@arrow.apache.org>
> Subject: ExecBatch in arrow execution engine
>
> Hi there,
>
> I would like to use apache arrow execution engine for some computation. I
> found `ExecBatch` instead of `RecordBatch` is used for execution engine's
> node, and I wonder how I can attach some additional information such as
> schema/metadata for the `ExecBatch` during execution so that they can be
> used by a custom ExecNode.
>
> In my first use case, the computation flow looks like this:
>
> scanner <===> custom filter node <===> query client
>
> 1) The scanner is a custom scanner that will load some data from disk, and
> it accepts a pushed down custom filter expression (not the arrow filter
> expression but a homebrewed filter expression), and the scanner will use
> this custom filter expression to avoid loading data from disk as much as
> possible but it may return a superset of matching data to the successor
> nodes because the capability of pushed down filter.
>
> 2) And its successor node is a filter node, which will do some additional
> filtering if needed. The scanner is aware that if a result batch retrieved
> needs additional filtering or not, and I would like to make scanner pass
> some batch specific metadata like "additional_filtering_required:
> true/false" along with the batch to the filter node, but I cannot figure
> out how this could be done for the `ExecBatch`.
>
> In my other use case, I would like to attach a batch specific schema to
> each batch returned by some nodes.
>
> Basically, I wonder within the current framework, if there is any chance I
> could attach some additional execution metadata/schema to the `ExecBatch`
> so that they could be used by a custom exec node. Could you please help?
> Thanks.

Reply via email to