Thanks all for the suggestions. > A possible solution is to derive from ExecBatch your own class I didn't give it a try yet but that is my initial thought and I am not sure if there is idiomatic and better solution in the query engine to do this.
> Does the existing filter "guarantee" mechanism work for you? I saw this in ExecBatch and I am not sure if my usage is considered as `abuse`, so far 1) I think this requires some additional comparison between the `guarantee` filter expression with the actual filter expression, which I have no idea how to do it yet 2) and my custom expression has some custom predicate like `contains(keyword)`, which I am not sure if it can be represented with the arrow filter expression But this is an option and I can do some further investigation on. > Can you attach your metadata as an actual column using a scalar? This is what we do with the __filename column today. Thanks for the pointer. I am not aware of this, and will look into it. > https://issues.apache.org/jira/browse/ARROW-12873 This seems to be exact what I am looking for since it allows to tag with any information to a batch. I will keep an eye on it. Thanks again for all the options. I will look into them and see which one fits better in my case. Regards, Yue On Tue, May 10, 2022 at 3:50 AM David Li <lidav...@apache.org> wrote: > Also see this related discussion, which petered out: > https://issues.apache.org/jira/browse/ARROW-12873 > > On Mon, May 9, 2022, at 15:40, Weston Pace wrote: > > Any kind of "batch-level" information is a little tricky in the > > execution engine because nodes are free to chop up and recombine > > batches as they see fit. For example, the output of a join node is > > going to contain data from at least two different input batches. Even > > nodes with a single input and single output could be splitting batches > > into smaller work items or accumulating batches into larger work > > items. A few thoughts come to mind: > > > > Does the existing filter "guarantee" mechanism work for you? Each > > batch can be attached an expression which is guaranteed to be true. > > The filter node uses this expression to simplify the filter it needs > > to apply. For example, if your custom scanner determines that `x > > > 50` is always true then that can be attached as a guarantee. Later, > > if you need to apply the filter `x < 30` then the filter node knows it > > can exclude the entire batch based on the guarantee. However, the > > guarantee suffers from the above described "batch-level" problems > > (e.g. a join node will not include guarantees in the output). > > > > Can you attach your metadata as an actual column using a scalar? This > > is what we do with the __filename column today. > > > > On Mon, May 9, 2022 at 5:24 AM Yaron Gvili <rt...@hotmail.com> wrote: > >> > >> Hi Yue, > >> > >> From my limited experience with the execution engine, my understanding > is that the API allows streaming only an ExecBatch from one node to > another. A possible solution is to derive from ExecBatch your own class > (say) RichExecBatch that carries any extra metadata you want. If in your > execution plan, each node that expects to receive a RichExecBatch gets it > directly from a sending node that makes it (both of which you could > implement), then I think this could work and may be enough for your use > case. However, note that when there are intermediate nodes in between such > sending and receiving nodes, this may well break because an intermediate > node could output a fresh ExecBatch even when receiving a RichExecBatch as > input, like filter_node does [1], for example. > >> > >> [1] > https://github.com/apache/arrow/blob/35119f29b0e0de68b1ccc5f2066e0cc7d27fddd0/cpp/src/arrow/compute/exec/filter_node.cc#L98 > >> > >> > >> Yaron. > >> > >> ________________________________ > >> From: Yue Ni <niyue....@gmail.com> > >> Sent: Monday, May 9, 2022 10:28 AM > >> To: dev@arrow.apache.org <dev@arrow.apache.org> > >> Subject: ExecBatch in arrow execution engine > >> > >> Hi there, > >> > >> I would like to use apache arrow execution engine for some computation. > I > >> found `ExecBatch` instead of `RecordBatch` is used for execution > engine's > >> node, and I wonder how I can attach some additional information such as > >> schema/metadata for the `ExecBatch` during execution so that they can be > >> used by a custom ExecNode. > >> > >> In my first use case, the computation flow looks like this: > >> > >> scanner <===> custom filter node <===> query client > >> > >> 1) The scanner is a custom scanner that will load some data from disk, > and > >> it accepts a pushed down custom filter expression (not the arrow filter > >> expression but a homebrewed filter expression), and the scanner will use > >> this custom filter expression to avoid loading data from disk as much as > >> possible but it may return a superset of matching data to the successor > >> nodes because the capability of pushed down filter. > >> > >> 2) And its successor node is a filter node, which will do some > additional > >> filtering if needed. The scanner is aware that if a result batch > retrieved > >> needs additional filtering or not, and I would like to make scanner pass > >> some batch specific metadata like "additional_filtering_required: > >> true/false" along with the batch to the filter node, but I cannot figure > >> out how this could be done for the `ExecBatch`. > >> > >> In my other use case, I would like to attach a batch specific schema to > >> each batch returned by some nodes. > >> > >> Basically, I wonder within the current framework, if there is any > chance I > >> could attach some additional execution metadata/schema to the > `ExecBatch` > >> so that they could be used by a custom exec node. Could you please help? > >> Thanks. >