Thanks all for the suggestions.

> A possible solution is to derive from ExecBatch your own class
I didn't give it a try yet but that is my initial thought and I am not sure
if there is idiomatic and better solution in the query engine to do this.

> Does the existing filter "guarantee" mechanism work for you?
I saw this in ExecBatch and I am not sure if my usage is considered as
`abuse`, so far
1) I think this requires some additional comparison between the `guarantee`
filter expression with the actual filter expression, which I have no idea
how to do it yet
2) and my custom expression has some custom predicate like
`contains(keyword)`, which I am not sure if it can be represented with the
arrow filter expression
But this is an option and I can do some further investigation on.

> Can you attach your metadata as an actual column using a scalar?  This is
what we do with the __filename column today.
Thanks for the pointer. I am not aware of this, and will look into it.

> https://issues.apache.org/jira/browse/ARROW-12873
This seems to be exact what I am looking for since it allows to tag with
any information to a batch. I will keep an eye on it.

Thanks again for all the options. I will look into them and see which one
fits better in my case.

Regards,
Yue


On Tue, May 10, 2022 at 3:50 AM David Li <lidav...@apache.org> wrote:

> Also see this related discussion, which petered out:
> https://issues.apache.org/jira/browse/ARROW-12873
>
> On Mon, May 9, 2022, at 15:40, Weston Pace wrote:
> > Any kind of "batch-level" information is a little tricky in the
> > execution engine because nodes are free to chop up and recombine
> > batches as they see fit.  For example, the output of a join node is
> > going to contain data from at least two different input batches.  Even
> > nodes with a single input and single output could be splitting batches
> > into smaller work items or accumulating batches into larger work
> > items.  A few thoughts come to mind:
> >
> > Does the existing filter "guarantee" mechanism work for you?  Each
> > batch can be attached an expression which is guaranteed to be true.
> > The filter node uses this expression to simplify the filter it needs
> > to apply.  For example, if your custom scanner determines that `x >
> > 50` is always true then that can be attached as a guarantee.  Later,
> > if you need to apply the filter `x < 30` then the filter node knows it
> > can exclude the entire batch based on the guarantee.  However, the
> > guarantee suffers from the above described "batch-level" problems
> > (e.g. a join node will not include guarantees in the output).
> >
> > Can you attach your metadata as an actual column using a scalar?  This
> > is what we do with the __filename column today.
> >
> > On Mon, May 9, 2022 at 5:24 AM Yaron Gvili <rt...@hotmail.com> wrote:
> >>
> >> Hi Yue,
> >>
> >> From my limited experience with the execution engine, my understanding
> is that the API allows streaming only an ExecBatch from one node to
> another. A possible solution is to derive from ExecBatch your own class
> (say) RichExecBatch that carries any extra metadata you want. If in your
> execution plan, each node that expects to receive a RichExecBatch gets it
> directly from a sending node that makes it (both of which you could
> implement), then I think this could work and may be enough for your use
> case. However, note that when there are intermediate nodes in between such
> sending and receiving nodes, this may well break because an intermediate
> node could output a fresh ExecBatch even when receiving a RichExecBatch as
> input, like filter_node does [1], for example.
> >>
> >> [1]
> https://github.com/apache/arrow/blob/35119f29b0e0de68b1ccc5f2066e0cc7d27fddd0/cpp/src/arrow/compute/exec/filter_node.cc#L98
> >>
> >>
> >> Yaron.
> >>
> >> ________________________________
> >> From: Yue Ni <niyue....@gmail.com>
> >> Sent: Monday, May 9, 2022 10:28 AM
> >> To: dev@arrow.apache.org <dev@arrow.apache.org>
> >> Subject: ExecBatch in arrow execution engine
> >>
> >> Hi there,
> >>
> >> I would like to use apache arrow execution engine for some computation.
> I
> >> found `ExecBatch` instead of `RecordBatch` is used for execution
> engine's
> >> node, and I wonder how I can attach some additional information such as
> >> schema/metadata for the `ExecBatch` during execution so that they can be
> >> used by a custom ExecNode.
> >>
> >> In my first use case, the computation flow looks like this:
> >>
> >> scanner <===> custom filter node <===> query client
> >>
> >> 1) The scanner is a custom scanner that will load some data from disk,
> and
> >> it accepts a pushed down custom filter expression (not the arrow filter
> >> expression but a homebrewed filter expression), and the scanner will use
> >> this custom filter expression to avoid loading data from disk as much as
> >> possible but it may return a superset of matching data to the successor
> >> nodes because the capability of pushed down filter.
> >>
> >> 2) And its successor node is a filter node, which will do some
> additional
> >> filtering if needed. The scanner is aware that if a result batch
> retrieved
> >> needs additional filtering or not, and I would like to make scanner pass
> >> some batch specific metadata like "additional_filtering_required:
> >> true/false" along with the batch to the filter node, but I cannot figure
> >> out how this could be done for the `ExecBatch`.
> >>
> >> In my other use case, I would like to attach a batch specific schema to
> >> each batch returned by some nodes.
> >>
> >> Basically, I wonder within the current framework, if there is any
> chance I
> >> could attach some additional execution metadata/schema to the
> `ExecBatch`
> >> so that they could be used by a custom exec node. Could you please help?
> >> Thanks.
>

Reply via email to