Hi Ryan, Thanks for summarizing and sending out the notes! I've created the JIRA ticket to add v2 statements for all the commands that need to resolve a table: https://issues.apache.org/jira/browse/SPARK-29481
Contributions to it are appreciated! Thanks, Wenchen On Fri, Oct 11, 2019 at 7:05 AM Ryan Blue <rb...@netflix.com.invalid> wrote: > Here are my notes from last week's DSv2 sync. > > *Attendees*: > > Ryan Blue > Terry Kim > Wenchen Fan > > *Topics*: > > - SchemaPruning only supports Parquet and ORC? > - Out of order optimizer rules > - 3.0 work > - Rename session catalog to spark_catalog > - Finish TableProvider update to avoid another API change: pass all > table config from metastore > - Catalog behavior fix: > https://issues.apache.org/jira/browse/SPARK-29014 > - Stats push-down optimization: > https://github.com/apache/spark/pull/25955 > - DataFrameWriter v1/v2 compatibility progress > - Open PRs > - Update identifier resolution and table resolution: > https://github.com/apache/spark/pull/25747 > - Expose SerializableConfiguration: > https://github.com/apache/spark/pull/26005 > - Early DSv2 pushdown: https://github.com/apache/spark/pull/25955 > > *Discussion*: > > - Update identifier and table resolution > - Wenchen: Will not handle SPARK-29014, it is a pure refactor > - Ryan: I think this should separate the v2 rules from the v1 > fallback, to keep table and identifier resolution separate. The only > time > that table resolution needs to be done at the same time is for v1 > fallback. > - This was merged last week > - Update to use spark_catalog > - Wenchen: this will be a separate PR. > - Now open: https://github.com/apache/spark/pull/26071 > - Early DSv2 pushdown > - Ryan: this depends on fixing a few more tests. To validate there > are no calls to computeStats with the DSv2 relation, I’ve temporarily > removed the method. Other than a few remaining test failures where the > old > relation was expected, it looks like there are no uses of computeStats > before early pushdown in the optimizer. > - Wenchen: agreed that the batch was in the correct place in the > optimizer > - Ryan: once tests are passing, will add the computeStats > implementation back with Utils.isTesting to fail during testing when > called > before early pushdown, but will not fail at runtime > - Wenchen: when using v2, there is no way to configure custom options > for a JDBC table. For v1, the table was created and stored in the session > catalog, at which point Spark-specific properties like parallelism could be > stored. In v2, the catalog is the source of truth, so tables don’t get > created in the same way. Options are only passed in a create statement. > - Ryan: this could be fixed by allowing users to pass options as > table properties. We mix the two today, but if we used a prefix for > table > properties, “options.”, then you could use SET TBLPROPERTIES to get > around > this. That’s also better for compatibility. I’ll open a PR for this. > - Ryan: this could also be solved by adding an OPTIONS clause or > hint to SELECT > - Wenchen: There are commands without v2 statements. We should add v2 > statements to reject non-v1 uses. > - Ryan: Doesn’t the parser only parse up to 2 identifiers for > these? That would handle the majority of cases > - Wenchen: Yes, but there is still a problem for identifiers with 1 > part in v2 catalogs, like catalog.table. Commands that don’t support v2 > will use catalog.table in the v1 catalog. > - Ryan: Sounds like a good plan to update the parser and add > statements for these. Do we have a list of commands to update? > - Wenchen: REFRESH TABLE, ANALYZE TABLE, ALTER TABLE PARTITION, > etc. Will open an umbrella JIRA with a list. > > -- > Ryan Blue > Software Engineer > Netflix >