Re: Substrait, a new project focused on serialized algebra

Jacques Nadeau Sun, 03 Oct 2021 21:30:31 -0700

Hey Andrew,

Sorry for the late reply. For some reason, your email never came through to
me and I didn't know about it until someone just referenced it within the
apache mail archive. I'll respond to your questions inline. In general,
we're experimenting with the GitHub discussions functionality as opposed to
using a mailing list [1] so would be great if you also continued discussion
there.




> This seems like something that would be beneficial if you can get other
> projects to buy into it. [xkcd]


Completely agree. This project is designed specifically to help other
projects work together and will only be impactful if it can be leveraged by
several projects. I think we've made good progress thus far with
contributions from several key communities (Arrow C++, Datafusion, Iceberg,
Presto/Velox, Singlestore and others). The xkcd reference is rough. I agree
with the general sentiment but I'm not aware of any initiative that is
actually trying to do this cross-project (there are initiatives within a
single project that take a very "local" perspective"). To me, the
opportunity is about solving for extensibility and a library of common
conversions between varying standards. This is much like the LLVM project.
There were many people building languages and compilers before LLVM. By
building a standardized intermediate representation, LLVM enabled frontends
to build stuff independent of the backend targets of those creations.
Hopefully the same thing will be true for processing systems with Substrait.


> How did you agree to the four indicator implementations?


They were a proposed sketch. We've had further discussion about that topic
in this discussion topic [2]. Keep in mind that those projects are not
fundamental to Substrait but were just used to help guide an
initial attempt at coming up with a system-agnostic set of common modern
data types.

Are those projects committed to making breaking changes to move closer
> together?


No and Substrait doesn't strive to achieve that. The goal of Substrait is
to come up with a framework, a set of common primitives, and a strong
extensibility system so systems that have specialized needs can still work
with the Substrait format and specification and cross-operate with other
systems.

We are working on something similar in Apache Beam. Our goal is a
> unified model with high level APIs that allow users to switch out
> different engines ("runners"), our implementation could be simplified
> by a standard like yours. We take plans as SQL from Apache Calcite and
> Google ZetaSQL, via a programmatic API, and hopefully other sources in
> the future (Dataframes). We execute them on Beam runners (currently as
> a Java implementation, possibly Arrow in the future). Eventually we
> want these plans to run natively where supported (Flink, Spark or
> Google Dataflow). Previously we've focused on the top end (turning
> plans into Beam Java pipelines) but we are working to push the
> relational metadata down the stack now. We need something that is a
> superset of existing implementations,


This is exactly why I think there is a need for Substrait. I've talked to
people at nearly a dozen different companies in the last few months and
they are almost all struggling with these issues in one form or another.
People are starting to think about frontend data processing plan producers
and backend data processing engines independently in a more mature
manner. Without a standardized way to connect the two, we get a lot of
independent engines that can't work together (and a lot of dialect
variations that can only target a single engine).


> you appear more focused on a subset.


This is a misinterpretation of the goals of Substrait. While it is true
that Substrait will only have a subset of all possible data types, function
signatures and relational operations in the project proper, the goals of
the extension system are that any project can use additional git-uri based
extensions to describe the additional functionality.

It could still work but Google's funding to build Relational
> Beam is dependent on us providing support for internal use cases,
> which means native execution on Dataflow with ZetaSQL. Apache Beam
> isn't tied to ZetaSQL but we won't be able to adopt a standard that
> prevents us from passing the ZetaSQL tests.


The functions are significantly more difficult than the types. Inside
> Google, ZetaSQL is the standard and that team has defined a huge set
> of test cases[4] for their functions. It took them several years to
> get a catalog built and for everyone to standardize on it. I'm not
> sure the same thing is possible outside a company with a top down
> mandate to unify. My understanding is that they based their function
> catalog on the existing implementations that were being unified and
> used external references (SQL standard, Postgres) to resolve
> conflicts. They have a reference implementation that they consider the
> source of truth. This specification is frequently incompatible with
> Calcite, particularly around edge cases. Most engines had to make
> breaking changes to adopt this standard (for a public example, see
> BigQuery legacy SQL and standard SQL).


100% understand and appreciate this. I'm very familiar with this problem
having worked on federation across a dozen different relational databases
each with their own subset of function signatures and semantics (in a
previous life). Trying to get everyone to agree as it sounds like the
internal Google teams did is unrealistic in the context of a
community-driven initiative that reaches across many separate independent
OSS projects. This is not the intention of the Substrait project. Instead,
for example, I would expect Beam to declare a number of additional function
signatures beyond those defined in the Substrait project proper. Substrait
defines the way function signatures are declared as well as a library of
functions common to SQL implementations but it will obviously be the case
that other systems will have varying function needs. For example, in our
discussions with the Singlestore community, we talked about how Singlestore
typically upconverts floats to doubles for most operations whereas other
systems will maintain doubles. In that situation, they would have
alternative implementations of things like add that had different output
derivation rules.

Have you put any thought into how you plan to define the function catalog?


Quite a bit. You can see the latest sketch of some initial scalar functions
in this file [3]. As mentioned, the set of functions is extensible so there
may be many catalogs.


> What about validating an implementation's adherence to the standard?


The ultimate goal here would be a set of inputs and outputs for each
function signature, along with links to implementations for each in common
patterns (e.g. webassembly, c/llvm ir). However, I don't see Substrait's
goals to be the "certification" of implementations, rather the exposure of
toolkits to self-validate.


> How will you handle minor incompatibilities without
> effectively having a function for each dialect? (TRIM_SPARK,
> TRIM_TRINO, TRIM_ARROW, TRIM_ICEBERG...)


Separate semantics would be separate functions. They'd be declared with
either different names or using different catalogs. (Note that this doesn't
stop an engine from deciding to cross-bind their version of trim to the one
defined inside the Substrait standard library.


> What about functions that aren't in your chosen dialects?


I think this goes back to the previous comment I made about a
misinterpretation of the goals of Substrait. The goal here is not for us to
make sure that every system follows the exact same rules. The goal is for
there to be a clear way to express a system's intention for common
production and/or consumption.

One of the key challenges in all of this is going to be the transformation
from one set of semantics to another. This will always be a difficult
problem. However, I expect the Substrait project to not only include a
number of separate function catalogs (e.g. arrow functions, presto
functions, etc) but also a set of both lossless and lossless polyfills that
help systems map different function sets to one another. The goal of
Substrait is clarity of intention but it is possible that when you move
from one system to another, that system will decide to use lossy polyfills
to solve a potential piece of functionality. In those cases, the goal would
be that the system would clearly state what kinds of lossy Substrait
transformations they undertake. I also expect that in time, systems created
after Substrait is better established will coalesce around a set of common
pieces of functionality. (The OSS/organic way to achieve what Google
achieved through top-down management.)

As I said, sorry for the late response. I'm still surprised that I never
received your email. I would be happy to discuss further here as well as to
collaborate on the discussions on the Substrait github.

Thanks,
Jacques


[1] https://github.com/substrait-io/substrait/discussions
[2] https://github.com/substrait-io/substrait/discussions/2
[3]
https://github.com/substrait-io/substrait/blob/b950440a10a9b0dd5d3e936fa54be21d6ea2ccb8/extensions/scalar_functions.yaml

Re: Substrait, a new project focused on serialized algebra

Reply via email to