Re: [DataFusion] [Discuss] Output Schema for queries with multiple relations

Andrew Lamb Wed, 19 May 2021 06:32:43 -0700

I read the invariants doc and field output doc again and I think they all
make sense to me. Thanks QP


On Wed, May 19, 2021 at 3:09 AM QP Hou <q...@scribd.com.invalid> wrote:

> Hi all,
>
> Following up on this.
>
> We have updated the output schema doc [1] and updated invariant doc
> [2] for the final round of review.
>
> In the updated invariant doc, the main change we introduced compared
> to the previous version is as follows:
>
> We now enforce strict schema equality in all plan optimization
> invariants. As a result, optimizations like reordering join sides need
> to add an extra projection to maintain the schema field order. We
> believe the extra projection should have minimal overhead. The upside
> is it will help keep the field order semantic simple and easy for end
> users to understand.
>
> In the draft PR [3], Andy raised a concern that by referring to
> physical columns using indices instead of names, it might limit our
> ability to support schemaless data sources in the future. After
> thinking more on this, I think the current design can be extended to
> support schemaless data sources in the future by going one of the
> following two routes:
>
> * Make the index field in physical columns optional. During physical
> plan execution, we could fallback to the name field for schemaless
> data sources while keep using indices for data sources that have
> static schemas.
> * Introduce a new type of physical column expression to refer columns
> in schemaless data sources
>
> I intentionally left out discussion of schemaless data sources in the
> updated invariant doc to keep the scope manageable for smaller
> incremental deliverables and ease of review. My main goal here is to
> make sure whatever design change we propose for multi-relations
> support won't prevent us from supporting schemaless use-cases in the
> future.
>
> If you have any feedback or concern with the current design, now is a
> good time to raise them :)
>
> I am aiming to get the implementation PR out of draft mode in a week or so.
>
> [1]:
> https://docs.google.com/document/d/1uviWavwEGD3qxwMk2AGkOgp6ENrvKGiMWQhHNbqPwhg/
> [2]:
> https://docs.google.com/document/d/1dbK-3eaTHlzZcHzpTk1h-LA3b7dcxsVBcoZeVKYIPwI/
> [3]:
> https://github.com/apache/arrow-datafusion/pull/55#issuecomment-829296665
>
> Thanks,
> QP Hou
>
> On Wed, May 5, 2021 at 3:52 AM Andrew Lamb <al...@influxdata.com> wrote:
> >
> > I wanted to bring some additional attention to some discussion occurring
> on
> > a PR [1], specifically the proposal of how to construct output field
> names
> > from queries that have multiple relations (that may have the same input
> > field).
> >
> > The documents are:
> > * Document for output schema field name semantics with examples: [2]
> > * Proposed change to @jorgecarleitao 's invariant doc [3]
> > * Updated invariant doc with proposed changes applied [4]
> >
> > Please comment on the PR / in the docs if you are interested.
> >
> > Andrew
> >
> > [1]
> >
> https://github.com/apache/arrow-datafusion/pull/55#issuecomment-831405269
> > [2]
> >
> https://docs.google.com/document/d/1uviWavwEGD3qxwMk2AGkOgp6ENrvKGiMWQhHNbqPwhg/edit?usp=sharing
> > [3]
> >
> https://docs.google.com/document/d/158gbfDp8pcakfriT2l7dHChwJB43_RV7lcWfxEC73ng/edit?usp=sharing
> > [4]
> >
> https://docs.google.com/document/d/1dbK-3eaTHlzZcHzpTk1h-LA3b7dcxsVBcoZeVKYIPwI/edit?usp=sharing
>

Re: [DataFusion] [Discuss] Output Schema for queries with multiple relations

Reply via email to