Re: Substrait, a new project focused on serialized algebra

Andrew Pilloud Wed, 08 Sep 2021 11:39:45 -0700

It looks like this project is just you at the moment? I don't see any
pull requests or a mailing list. (I'm not on slack.) This seems like
something that would be beneficial if you can get other projects to
buy into it. [0] How did you agree to the four indicator
implementations? Are those projects committed to making breaking
changes to move closer together?

We are working on something similar in Apache Beam. Our goal is a
unified model with high level APIs that allow users to switch out
different engines ("runners"), our implementation could be simplified
by a standard like yours. We take plans as SQL from Apache Calcite and
Google ZetaSQL, via a programmatic API, and hopefully other sources in
the future (Dataframes). We execute them on Beam runners (currently as
a Java implementation, possibly Arrow in the future). Eventually we
want these plans to run natively where supported (Flink, Spark or
Google Dataflow). Previously we've focused on the top end (turning
plans into Beam Java pipelines) but we are working to push the
relational metadata down the stack now. We need something that is a
superset of existing implementations, you appear more focused on a
subset. It could still work but Google's funding to build Relational
Beam is dependent on us providing support for internal use cases,
which means native execution on Dataflow with ZetaSQL. Apache Beam
isn't tied to ZetaSQL but we won't be able to adopt a standard that
prevents us from passing the ZetaSQL tests.

Our progression is very similar to your components list. We've already
prototyped most of the functionality inside our SQL codebase. We are
now pushing it into Beam core. The type system is mostly complete
(Beam Schemas[1]), and we are now adding support for Field
References[2].

Our pain points with the type system have been around precision of
DECIMAL, FLOAT, DOUBLE, and time types. Different implementations have
different levels of precision supported, this results in unexpected
behavior on the edge cases. On the numeric types, we've mostly defined
those as unlimited precision and added "logical types" (what you call
"User Defined Types") with restricted precision. For time types, we've
been pushing those into "User Defined Types". We have a catalog of
built-in "User Defined Types"[3]. We've effectively decided to provide
a superset of building block types.

The functions are significantly more difficult than the types. Inside
Google, ZetaSQL is the standard and that team has defined a huge set
of test cases[4] for their functions. It took them several years to
get a catalog built and for everyone to standardize on it. I'm not
sure the same thing is possible outside a company with a top down
mandate to unify. My understanding is that they based their function
catalog on the existing implementations that were being unified and
used external references (SQL standard, Postgres) to resolve
conflicts. They have a reference implementation that they consider the
source of truth. This specification is frequently incompatible with
Calcite, particularly around edge cases. Most engines had to make
breaking changes to adopt this standard (for a public example, see
BigQuery legacy SQL and standard SQL).

Have you put any thought into how you plan to define the function
catalog? What about validating an implementation's adherence to the
standard? How will you handle minor incompatibilities without
effectively having a function for each dialect? (TRIM_SPARK,
TRIM_TRINO, TRIM_ARROW, TRIM_ICEBERG...) What about functions that
aren't in your chosen dialects?

Andrew

[0] https://xkcd.com/927/
[1] 
https://github.com/apache/beam/blob/243128a8fc52798e1b58b0cf1a271d95ee7aa241/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/Schema.java#L419
[2] 
https://docs.google.com/document/d/1eHSO3aIsAUmiVtfDL-pEFenNBRKt26KkF4dQloMhpBQ/edit
[3] 
https://github.com/apache/beam/tree/243128a8fc52798e1b58b0cf1a271d95ee7aa241/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/logicaltypes
[4] https://github.com/google/zetasql/tree/master/zetasql/compliance

On Wed, Sep 8, 2021 at 8:31 AM Jacques Nadeau <[email protected]> wrote:
>
> Hey all,
>
> For some time I've been thinking that having a common serialized
> representation of query plans would be helpful across multiple related
> projects. I started working on something independently in this space
> several months ago. Since then, Arrow started exploring "Arrow IR" and
> Iceberg was proposing something similar to support a cross-engine
> structured view. Given the different veins of interest, I think we should
> combine forces on a consolidated consensus-driven solution.
>
> As I've had more conversations with different people, I've come to the
> conclusion that given the complexity of the task and people's
> competing priorities, a separate "Switzerland project" is the best way to
> find common ground. As such, I've started to sketch out a specification [1]
> called Substrait. One of my key goals with this effort is to expose Calcite
> functionality to more users and expose alternative ways to encapsulate
> Calcite functionality as a microservice or series of microservices.
>
> For those that are interested, please join the Substrait Slack. My first
> goal is to come to a consensus on the type system of simple [2], compound
> [3] and physical [4] types. The general approach I'm proposing:
>
>    - Use Spark, Trino, Arrow and Iceberg as the four indicators of whether
>    something should be part of the spec. It must exist in at least two systems
>    to be formalized.
>    - Avoid a formal distinction between logical and physical (types,
>    operators, etc)
>    - Lean more towards simple types than compound types when systems
>    generally use only a constrained set of parameters (e.g. timestamp(3) and
>    timestamp(6) as opposed to timestamp(x)).
>    - Provide substantial structured extensibility (avoid black boxes as
>    much as possible)
>
>
> Links for Substrait:
> Site: https://substrait.io
> Spec source: https://github.com/substrait-io/substrait/tree/main/site/docs
> Binary format: https://github.com/substrait-io/substrait/tree/main/binary
>
> Would love to hear your thoughts!
> Jacques
>
> [1] https://substrait.io/spec/specification/#components
> [2] https://substrait.io/types/simple_logical_types/
> [3] https://substrait.io/types/compound_logical_types/
> [4] https://substrait.io/types/physical_types/

Re: Substrait, a new project focused on serialized algebra

Reply via email to