Re: DataSourceV2 sync notes - 2 October 2019

Wenchen Fan Fri, 18 Oct 2019 07:46:16 -0700

Hi Ryan,

Thanks for summarizing and sending out the notes! I've created the JIRA
ticket to add v2 statements for all the commands that need to resolve a
table: https://issues.apache.org/jira/browse/SPARK-29481


Contributions to it are appreciated!

Thanks,
Wenchen

On Fri, Oct 11, 2019 at 7:05 AM Ryan Blue <rb...@netflix.com.invalid> wrote:

> Here are my notes from last week's DSv2 sync.
>
> *Attendees*:
>
> Ryan Blue
> Terry Kim
> Wenchen Fan
>
> *Topics*:
>
>    - SchemaPruning only supports Parquet and ORC?
>    - Out of order optimizer rules
>    - 3.0 work
>       - Rename session catalog to spark_catalog
>       - Finish TableProvider update to avoid another API change: pass all
>       table config from metastore
>       - Catalog behavior fix:
>       https://issues.apache.org/jira/browse/SPARK-29014
>       - Stats push-down optimization:
>       https://github.com/apache/spark/pull/25955
>       - DataFrameWriter v1/v2 compatibility progress
>    - Open PRs
>       - Update identifier resolution and table resolution:
>       https://github.com/apache/spark/pull/25747
>       - Expose SerializableConfiguration:
>       https://github.com/apache/spark/pull/26005
>       - Early DSv2 pushdown: https://github.com/apache/spark/pull/25955
>
> *Discussion*:
>
>    - Update identifier and table resolution
>       - Wenchen: Will not handle SPARK-29014, it is a pure refactor
>       - Ryan: I think this should separate the v2 rules from the v1
>       fallback, to keep table and identifier resolution separate. The only 
> time
>       that table resolution needs to be done at the same time is for v1 
> fallback.
>       - This was merged last week
>    - Update to use spark_catalog
>       - Wenchen: this will be a separate PR.
>       - Now open: https://github.com/apache/spark/pull/26071
>    - Early DSv2 pushdown
>       - Ryan: this depends on fixing a few more tests. To validate there
>       are no calls to computeStats with the DSv2 relation, I’ve temporarily
>       removed the method. Other than a few remaining test failures where the 
> old
>       relation was expected, it looks like there are no uses of computeStats
>       before early pushdown in the optimizer.
>       - Wenchen: agreed that the batch was in the correct place in the
>       optimizer
>       - Ryan: once tests are passing, will add the computeStats
>       implementation back with Utils.isTesting to fail during testing when 
> called
>       before early pushdown, but will not fail at runtime
>    - Wenchen: when using v2, there is no way to configure custom options
>    for a JDBC table. For v1, the table was created and stored in the session
>    catalog, at which point Spark-specific properties like parallelism could be
>    stored. In v2, the catalog is the source of truth, so tables don’t get
>    created in the same way. Options are only passed in a create statement.
>       - Ryan: this could be fixed by allowing users to pass options as
>       table properties. We mix the two today, but if we used a prefix for 
> table
>       properties, “options.”, then you could use SET TBLPROPERTIES to get 
> around
>       this. That’s also better for compatibility. I’ll open a PR for this.
>       - Ryan: this could also be solved by adding an OPTIONS clause or
>       hint to SELECT
>    - Wenchen: There are commands without v2 statements. We should add v2
>    statements to reject non-v1 uses.
>       - Ryan: Doesn’t the parser only parse up to 2 identifiers for
>       these? That would handle the majority of cases
>       - Wenchen: Yes, but there is still a problem for identifiers with 1
>       part in v2 catalogs, like catalog.table. Commands that don’t support v2
>       will use catalog.table in the v1 catalog.
>       - Ryan: Sounds like a good plan to update the parser and add
>       statements for these. Do we have a list of commands to update?
>       - Wenchen: REFRESH TABLE, ANALYZE TABLE, ALTER TABLE PARTITION,
>       etc. Will open an umbrella JIRA with a list.
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>

Re: DataSourceV2 sync notes - 2 October 2019

Reply via email to