> This schema evolution work has not been done in the scanner yet so if
> you are interested you might want to look at ARROW-11003[1].

Thanks. I will keep an eye on it.

> Or did you have a different use case for multiple schemas in mind that
> doesn't quite fit the "promote to common schema" case?

I think my use case is not exactly the same as the dataset scanner use
case. In my case, the data set could be a set of JSON files (JSON lines
actually, https://jsonlines.org) that are essentially structural logs from
other applications (like this example,
https://stackify.com/what-is-structured-logging-and-why-developers-need-it/)
, so the schema may vary a lot from app to app. Currently we have a
home-brewed compute engine leveraging arrow as data format, and I am
researching if there is any chance we could leverage arrow's new C++
compute engine, probably writing some custom execution operators to make it
work.

On Mon, Nov 15, 2021 at 4:12 PM Weston Pace <weston.p...@gmail.com> wrote:

> > "The consumer of a Scan does not need to know how it is implemented,
> > only that a uniform API is provided to obtain the next RecordBatch
> > with a known schema.", I interpret this as `Scan` operator may
> > produce multiple RecordBatches, and each of them should have a known
> > schema, but next batch's schema could be different with the previous
> > batch. Is this understanding correct?
>
> No.  The compute/query engine expects that all batches share the same
> schema.
>
> However, it is often the case that files in a dataset have different
> schemas.  One way to handle this is to "promote up" to the most common
> data type for a given column.  For example, if some columns are int32
> and some are int64 then promote all to int64.  "schema evolution" is
> one term I've seen thrown around for this process.
>
> I believe the plan is to eventually have this promotion happen during
> the scanning process.  So the scanner has to deal with fragments that
> may have differing schemas but by the time the data reaches the query
> engine the record batches all have the consistent common schema.
>
> This schema evolution work has not been done in the scanner yet so if
> you are interested you might want to look at ARROW-11003[1].  Or did
> you have a different use case for multiple schemas in mind that
> doesn't quite fit the "promote to common schema" case?
>
> [1] https://issues.apache.org/jira/browse/ARROW-11003
>
>
> On Sun, Nov 14, 2021 at 7:03 PM Yue Ni <niyue....@gmail.com> wrote:
> >
> > Hi there,
> >
> > I am evaluating Apache Arrow C++ compute engine for my project, and
> wonder
> > what the schema assumption is for execution operators in the compute
> > engine.
> >
> > In my use case, multiple record batches for computation may have
> different
> > schemas. I read the Apache Arrow Query Engine for C++ design doc (
> >
> https://docs.google.com/document/d/10RoUZmiMQRi_J1FcPeVAUAMJ6d_ZuiEbaM2Y33sNPu4
> ),and
> > for the `Scan` operator, it is said "The consumer of a Scan does not need
> > to know how it is implemented, only that a uniform API is provided to
> > obtain the next RecordBatch with a known schema.", I interpret this as
> > `Scan` operator may produce multiple RecordBatches, and each of them
> should
> > have a known schema, but next batch's schema could be different with the
> > previous batch. Is this understanding correct?
> >
> > And I read arrow's source code, in `exec_plan.h`:
> > ```
> > class ARROW_EXPORT ExecNode {
> > ...
> > /// The datatypes for batches produced by this node
> >   const std::shared_ptr<Schema>& output_schema() const { return
> > output_schema_; }
> > ...
> > ```
> > It looks like each `ExecNode` needs to provide an `output_schema`. Is it
> > allowed to return `output_schema` that may change during ExecNode's
> > execution? If I would like to implement an execution node that will
> produce
> > multiple batches that may have different schemas, is this feasible within
> > Arrow C++ compute engine API framework? Thanks.
> >
> > Regards,
> > Yue
>

Reply via email to