There are many projects in the Spark ecosystem — like Deequ and Great Expectations — that are focused on expressing and enforcing data quality checks.
In the more complex cases, these checks do not fit the scope of the checks that a typical data source may support (i.e. PK, FK, CHECK), so these projects have designed their own APIs for expressing various constraints, plus their own analyzers for figuring out how to run all the checks with as little repeated scanning or computation as possible. It’s a bit forward looking since this is still a proposal, but I wonder if the Pipelines API will eventually be able to address this kind of use case directly. I believe arbitrary data constraints can be expressed naturally as materialized views in a pipeline <https://nchammas.com/writing/query-language-constraint-language>, and a hypothetical Pipelines API could look something like this: @pipelines.assertion def sufficient_on_call_coverage(): return ( spark.table(“doctors") .where(col("on_call")) .select(count("*") >= 1) ) With visibility into the dependencies of a given assertion or constraint, a Pipeline could perhaps figure out how to enforce it without wasting resources rereading or recomputing stuff across nodes in the graph. Again, this is a bit forward looking, but is this kind of idea “on topic” for the Pipelines API? Nick > On Apr 5, 2025, at 5:30 PM, Sandy Ryza <sa...@apache.org> wrote: > > Hi all – starting a discussion thread for a SPIP that I've been working on > with Chao Sun, Kent Yao, Yuming Wang, and Jie Yang: [JIRA > <https://issues.apache.org/jira/browse/SPARK-51727>] [Doc > <https://docs.google.com/document/d/1PsSTngFuRVEOvUGzp_25CQL1yfzFHFr02XdMfQ7jOM4/edit?tab=t.0>]. > > The SPIP proposes extending Spark's lazy, declarative execution model beyond > single queries, to pipelines that keep multiple datasets up to date. It > introduces the ability to compose multiple transformations into a single > declarative dataflow graph. > > Declarative pipelines aim to simplify the development and management of data > pipelines, by removing the need for manual orchestration of dependencies and > making it possible to catch many errors before any execution steps are > launched. > > Declarative pipelines can include both batch and streaming computations, > leveraging Structured Streaming for stream processing and new materialized > view syntax for batch processing. Tight integration with Spark SQL's analyzer > enables deeper analysis and earlier error detection than is achievable with > more generic frameworks. > > Let us know what you think! >