> "The consumer of a Scan does not need to know how it is implemented, > only that a uniform API is provided to obtain the next RecordBatch > with a known schema.", I interpret this as `Scan` operator may > produce multiple RecordBatches, and each of them should have a known > schema, but next batch's schema could be different with the previous > batch. Is this understanding correct?
No. The compute/query engine expects that all batches share the same schema. However, it is often the case that files in a dataset have different schemas. One way to handle this is to "promote up" to the most common data type for a given column. For example, if some columns are int32 and some are int64 then promote all to int64. "schema evolution" is one term I've seen thrown around for this process. I believe the plan is to eventually have this promotion happen during the scanning process. So the scanner has to deal with fragments that may have differing schemas but by the time the data reaches the query engine the record batches all have the consistent common schema. This schema evolution work has not been done in the scanner yet so if you are interested you might want to look at ARROW-11003[1]. Or did you have a different use case for multiple schemas in mind that doesn't quite fit the "promote to common schema" case? [1] https://issues.apache.org/jira/browse/ARROW-11003 On Sun, Nov 14, 2021 at 7:03 PM Yue Ni <niyue....@gmail.com> wrote: > > Hi there, > > I am evaluating Apache Arrow C++ compute engine for my project, and wonder > what the schema assumption is for execution operators in the compute > engine. > > In my use case, multiple record batches for computation may have different > schemas. I read the Apache Arrow Query Engine for C++ design doc ( > https://docs.google.com/document/d/10RoUZmiMQRi_J1FcPeVAUAMJ6d_ZuiEbaM2Y33sNPu4),and > for the `Scan` operator, it is said "The consumer of a Scan does not need > to know how it is implemented, only that a uniform API is provided to > obtain the next RecordBatch with a known schema.", I interpret this as > `Scan` operator may produce multiple RecordBatches, and each of them should > have a known schema, but next batch's schema could be different with the > previous batch. Is this understanding correct? > > And I read arrow's source code, in `exec_plan.h`: > ``` > class ARROW_EXPORT ExecNode { > ... > /// The datatypes for batches produced by this node > const std::shared_ptr<Schema>& output_schema() const { return > output_schema_; } > ... > ``` > It looks like each `ExecNode` needs to provide an `output_schema`. Is it > allowed to return `output_schema` that may change during ExecNode's > execution? If I would like to implement an execution node that will produce > multiple batches that may have different schemas, is this feasible within > Arrow C++ compute engine API framework? Thanks. > > Regards, > Yue