DataSourceV2 sync 3 April 2019

Ryan Blue Thu, 04 Apr 2019 16:22:40 -0700

Hi everyone, here are my notes from last night’s DSv2 sync.

As usual, please reply with comments or corrections.


Sorry for the lack of notes for the last meeting. I lost them when my
laptop battery died.

rb

*Topics*:

   - Open pull requests
      - PR #24233: Remove SaveMode (
      https://github.com/apache/spark/pull/24233)
      - PR #24117: Add public transform API (
      https://github.com/apache/spark/pull/24117)
      - PR #24246: Add TableCatalog API (
      https://github.com/apache/spark/pull/24246)
   - Add a hack to special-case file sources?
   - Partition transforms and tables

*Discussion*:

   - PR #24233: Remove SaveMode
      - Ryan: This PR has a hack that passes SaveMode directly to file
      sources. Because the behavior of v1 is unknown, this should not
be included
      in a release.
      - Wenchen: This is needed in order to use v2 file sources. Intent is
      migrate the file source to provide a reference implementation.
Without the
      hack, v2 sources can’t be used (by default) because we know it [v2] is
      broken [i.e., can’t run CTAS].
      - Ryan: v2 isn’t broken. The only problem is path-based tables
      because we don’t know how they should behave [Gengliang is looking into
      this]. Catalog tables aren’t broken so we should focus on getting that
      support done.
      - Russel/Matt: is this needed for current tests to pass?
      - Wenchen: no, tests run using v1.
      - Russel: +1 for not including the hack.
      - Ryan: Also, tests for the new plans will require an in-memory
      source that will be a cleaner reference implementation.
   - PR #24117 parser rules & code organization:
      - Ryan: Reynold suggests using expression instead of transform to get
      better error messages. By handling expressions, the AstBuilder can throw
      better error messages than the generated parser code. Parser generates
      messages like “expected identifier”, but the AstBuilder can do
better, like
      “expression is not a transformation”
      - Ryan: This is a good idea, but I’d like to avoid blocking this work
      and do this in a follow-up if the community agrees. This PR is
blocking the
      table catalog, which is blocking the new table resolution rules, new
      logical plans, etc. Delaying just to improve the error message
isn’t a good
      use of time.
      - Wenchen: agreed to update the parser errors later
      - Matt: how long will it take to fix it?
      - Ryan: even if it only takes a few hours to build and test, this
      will reasonably push the PR out by another few days
      - Wenchen and Dilip pointed out other rules that are matched by
      expression, like sub-queries, so this isn’t a simple fix.
      - Ryan: along the same lines, moving classes in this PR will cause
      conflicts and delay. We are also moving classes into catalyst. Can we fix
      the package organization when it is more convenient?
      - Wenchen: agreed to update organization later
      - Consensus was to move forward without fixing organization or
      parsing to unblock dependent changes, like #24246
   - PR #24117: transforms
      - Wenchen: Why not base Expression on the internal Expression class?
      For example, add references to Expression, not Transform.
      - Ryan: Intent is to be as simple as possible to start with. Literal
      doesn’t use references, and we can always promote the method to
expression
      later, but not the other way around.
      - Wenchen: What values are wrapped by Literal? InternalRow?
      - Ryan: the internal representations.
      - Matt: We would need to be careful about changing the internal
      representation then.
      - Ryan: It is already difficult to change an internal representation,
      even without this. And, these APIs already expose InternalRow,
so we aren’t
      creating new problems.
      - Matt: Yes, this isn’t a new problem.
      - Ryan: I’ll follow up to the PR with documentation for the
      representations used by InternalRow.
      - Ryan: there is some confusion on the PR about what transforms are.
      Reynold asked what dateHour returns. Part of the problem is that
“dateHour”
      is a confusing name; it should be “hourly” or “hours”. The name includes
      “date” to signal that it isn’t an hour in [0, 23], it is an
      hour-granularity partition transform.
      - Ryan: these transforms are not concrete functions. There are many
      functions that can partition timestamps into hour or day granularity
      ranges. The concrete transform function doesn’t matter for configuring
      tables.
      - Ryan: For example, the “bucket” function is conceptually the same,
      but has a different concrete implementation in Spark, Hive, Iceberg, and
      Cassandra tables.
      - Russel: commented on the craziness of Cassandra’s bucket transform
      - Ryan: Spark needs to work with all of these functions. When Spark
      needs a concrete implementation to run a bucket join, it should
look up the
      “bucket” function from the table’s catalog using the FunctionCatalog
      interface. That way it can prepare data for the other side of a bucketed
      join to work with bucketed data from any source.
   - PR #24246: Add TableCatalog API
      - Ryan: This PR is based on #24117, but please start reviewing now to
      make this go faster.

*Attendees*:

Ryan Blue
John Zhuge
Russel Spitzer
Gengliang Wang
Yuanjian Li
Matt Cheah
Yifei Huang
Felix Cheung
Dilip Biswal
Wenchen Fan
-- 
Ryan Blue
Software Engineer
Netflix

DataSourceV2 sync 3 April 2019

Reply via email to