Hi everyone, here are my notes from last night’s DSv2 sync. As usual, please reply with comments or corrections.
Sorry for the lack of notes for the last meeting. I lost them when my laptop battery died. rb *Topics*: - Open pull requests - PR #24233: Remove SaveMode ( https://github.com/apache/spark/pull/24233) - PR #24117: Add public transform API ( https://github.com/apache/spark/pull/24117) - PR #24246: Add TableCatalog API ( https://github.com/apache/spark/pull/24246) - Add a hack to special-case file sources? - Partition transforms and tables *Discussion*: - PR #24233: Remove SaveMode - Ryan: This PR has a hack that passes SaveMode directly to file sources. Because the behavior of v1 is unknown, this should not be included in a release. - Wenchen: This is needed in order to use v2 file sources. Intent is migrate the file source to provide a reference implementation. Without the hack, v2 sources can’t be used (by default) because we know it [v2] is broken [i.e., can’t run CTAS]. - Ryan: v2 isn’t broken. The only problem is path-based tables because we don’t know how they should behave [Gengliang is looking into this]. Catalog tables aren’t broken so we should focus on getting that support done. - Russel/Matt: is this needed for current tests to pass? - Wenchen: no, tests run using v1. - Russel: +1 for not including the hack. - Ryan: Also, tests for the new plans will require an in-memory source that will be a cleaner reference implementation. - PR #24117 parser rules & code organization: - Ryan: Reynold suggests using expression instead of transform to get better error messages. By handling expressions, the AstBuilder can throw better error messages than the generated parser code. Parser generates messages like “expected identifier”, but the AstBuilder can do better, like “expression is not a transformation” - Ryan: This is a good idea, but I’d like to avoid blocking this work and do this in a follow-up if the community agrees. This PR is blocking the table catalog, which is blocking the new table resolution rules, new logical plans, etc. Delaying just to improve the error message isn’t a good use of time. - Wenchen: agreed to update the parser errors later - Matt: how long will it take to fix it? - Ryan: even if it only takes a few hours to build and test, this will reasonably push the PR out by another few days - Wenchen and Dilip pointed out other rules that are matched by expression, like sub-queries, so this isn’t a simple fix. - Ryan: along the same lines, moving classes in this PR will cause conflicts and delay. We are also moving classes into catalyst. Can we fix the package organization when it is more convenient? - Wenchen: agreed to update organization later - Consensus was to move forward without fixing organization or parsing to unblock dependent changes, like #24246 - PR #24117: transforms - Wenchen: Why not base Expression on the internal Expression class? For example, add references to Expression, not Transform. - Ryan: Intent is to be as simple as possible to start with. Literal doesn’t use references, and we can always promote the method to expression later, but not the other way around. - Wenchen: What values are wrapped by Literal? InternalRow? - Ryan: the internal representations. - Matt: We would need to be careful about changing the internal representation then. - Ryan: It is already difficult to change an internal representation, even without this. And, these APIs already expose InternalRow, so we aren’t creating new problems. - Matt: Yes, this isn’t a new problem. - Ryan: I’ll follow up to the PR with documentation for the representations used by InternalRow. - Ryan: there is some confusion on the PR about what transforms are. Reynold asked what dateHour returns. Part of the problem is that “dateHour” is a confusing name; it should be “hourly” or “hours”. The name includes “date” to signal that it isn’t an hour in [0, 23], it is an hour-granularity partition transform. - Ryan: these transforms are not concrete functions. There are many functions that can partition timestamps into hour or day granularity ranges. The concrete transform function doesn’t matter for configuring tables. - Ryan: For example, the “bucket” function is conceptually the same, but has a different concrete implementation in Spark, Hive, Iceberg, and Cassandra tables. - Russel: commented on the craziness of Cassandra’s bucket transform - Ryan: Spark needs to work with all of these functions. When Spark needs a concrete implementation to run a bucket join, it should look up the “bucket” function from the table’s catalog using the FunctionCatalog interface. That way it can prepare data for the other side of a bucketed join to work with bucketed data from any source. - PR #24246: Add TableCatalog API - Ryan: This PR is based on #24117, but please start reviewing now to make this go faster. *Attendees*: Ryan Blue John Zhuge Russel Spitzer Gengliang Wang Yuanjian Li Matt Cheah Yifei Huang Felix Cheung Dilip Biswal Wenchen Fan -- Ryan Blue Software Engineer Netflix