I read the invariants doc and field output doc again and I think they all make sense to me. Thanks QP
On Wed, May 19, 2021 at 3:09 AM QP Hou <q...@scribd.com.invalid> wrote: > Hi all, > > Following up on this. > > We have updated the output schema doc [1] and updated invariant doc > [2] for the final round of review. > > In the updated invariant doc, the main change we introduced compared > to the previous version is as follows: > > We now enforce strict schema equality in all plan optimization > invariants. As a result, optimizations like reordering join sides need > to add an extra projection to maintain the schema field order. We > believe the extra projection should have minimal overhead. The upside > is it will help keep the field order semantic simple and easy for end > users to understand. > > In the draft PR [3], Andy raised a concern that by referring to > physical columns using indices instead of names, it might limit our > ability to support schemaless data sources in the future. After > thinking more on this, I think the current design can be extended to > support schemaless data sources in the future by going one of the > following two routes: > > * Make the index field in physical columns optional. During physical > plan execution, we could fallback to the name field for schemaless > data sources while keep using indices for data sources that have > static schemas. > * Introduce a new type of physical column expression to refer columns > in schemaless data sources > > I intentionally left out discussion of schemaless data sources in the > updated invariant doc to keep the scope manageable for smaller > incremental deliverables and ease of review. My main goal here is to > make sure whatever design change we propose for multi-relations > support won't prevent us from supporting schemaless use-cases in the > future. > > If you have any feedback or concern with the current design, now is a > good time to raise them :) > > I am aiming to get the implementation PR out of draft mode in a week or so. > > [1]: > https://docs.google.com/document/d/1uviWavwEGD3qxwMk2AGkOgp6ENrvKGiMWQhHNbqPwhg/ > [2]: > https://docs.google.com/document/d/1dbK-3eaTHlzZcHzpTk1h-LA3b7dcxsVBcoZeVKYIPwI/ > [3]: > https://github.com/apache/arrow-datafusion/pull/55#issuecomment-829296665 > > Thanks, > QP Hou > > On Wed, May 5, 2021 at 3:52 AM Andrew Lamb <al...@influxdata.com> wrote: > > > > I wanted to bring some additional attention to some discussion occurring > on > > a PR [1], specifically the proposal of how to construct output field > names > > from queries that have multiple relations (that may have the same input > > field). > > > > The documents are: > > * Document for output schema field name semantics with examples: [2] > > * Proposed change to @jorgecarleitao 's invariant doc [3] > > * Updated invariant doc with proposed changes applied [4] > > > > Please comment on the PR / in the docs if you are interested. > > > > Andrew > > > > [1] > > > https://github.com/apache/arrow-datafusion/pull/55#issuecomment-831405269 > > [2] > > > https://docs.google.com/document/d/1uviWavwEGD3qxwMk2AGkOgp6ENrvKGiMWQhHNbqPwhg/edit?usp=sharing > > [3] > > > https://docs.google.com/document/d/158gbfDp8pcakfriT2l7dHChwJB43_RV7lcWfxEC73ng/edit?usp=sharing > > [4] > > > https://docs.google.com/document/d/1dbK-3eaTHlzZcHzpTk1h-LA3b7dcxsVBcoZeVKYIPwI/edit?usp=sharing >