Re: output_schema for ExecNode

Weston Pace Mon, 15 Nov 2021 00:12:41 -0800

> "The consumer of a Scan does not need to know how it is implemented,
> only that a uniform API is provided to obtain the next RecordBatch
> with a known schema.", I interpret this as `Scan` operator may
> produce multiple RecordBatches, and each of them should have a known
> schema, but next batch's schema could be different with the previous
> batch. Is this understanding correct?

No.  The compute/query engine expects that all batches share the same
schema.

However, it is often the case that files in a dataset have different
schemas.  One way to handle this is to "promote up" to the most common
data type for a given column.  For example, if some columns are int32
and some are int64 then promote all to int64.  "schema evolution" is
one term I've seen thrown around for this process.

I believe the plan is to eventually have this promotion happen during
the scanning process.  So the scanner has to deal with fragments that
may have differing schemas but by the time the data reaches the query
engine the record batches all have the consistent common schema.

This schema evolution work has not been done in the scanner yet so if
you are interested you might want to look at ARROW-11003[1].  Or did
you have a different use case for multiple schemas in mind that
doesn't quite fit the "promote to common schema" case?

[1] https://issues.apache.org/jira/browse/ARROW-11003

On Sun, Nov 14, 2021 at 7:03 PM Yue Ni <niyue....@gmail.com> wrote:
>
> Hi there,
>
> I am evaluating Apache Arrow C++ compute engine for my project, and wonder
> what the schema assumption is for execution operators in the compute
> engine.
>
> In my use case, multiple record batches for computation may have different
> schemas. I read the Apache Arrow Query Engine for C++ design doc (
> https://docs.google.com/document/d/10RoUZmiMQRi_J1FcPeVAUAMJ6d_ZuiEbaM2Y33sNPu4),and
> for the `Scan` operator, it is said "The consumer of a Scan does not need
> to know how it is implemented, only that a uniform API is provided to
> obtain the next RecordBatch with a known schema.", I interpret this as
> `Scan` operator may produce multiple RecordBatches, and each of them should
> have a known schema, but next batch's schema could be different with the
> previous batch. Is this understanding correct?
>
> And I read arrow's source code, in `exec_plan.h`:
> ```
> class ARROW_EXPORT ExecNode {
> ...
> /// The datatypes for batches produced by this node
>   const std::shared_ptr<Schema>& output_schema() const { return
> output_schema_; }
> ...
> ```
> It looks like each `ExecNode` needs to provide an `output_schema`. Is it
> allowed to return `output_schema` that may change during ExecNode's
> execution? If I would like to implement an execution node that will produce
> multiple batches that may have different schemas, is this feasible within
> Arrow C++ compute engine API framework? Thanks.
>
> Regards,
> Yue

Re: output_schema for ExecNode

Reply via email to