Re: [VOTE] SPIP: Constraints in DSv2

Wenchen Fan Sat, 05 Apr 2025 10:34:28 -0700

Hi Angel,

This feature involves 3 parties:
- The end-user specifies constraints for their tables, via the SQL syntax
provided by Spark.
- Spark propagates the constraints to the backend connector of the tables,
and performs data validation during data writing if the connector asks
Spark to do so.
- The connector receives and stores the constraints, and exposes this
information to the engines it supports, for data validation and/or query
optimization.


The protocol between Spark and connectors is DS v2, making DS v2 the best
place to put this feature. There are also other data validation framework,
such as spark-expectation
<https://engineering.nike.com/spark-expectations/v2.1.1/>, but it's an
orthogonal topic. Table Constraint itself is a standard SQL feature and
many databases support it. I think it's reasonable to make Spark support it
as well.

Thanks,
Wenchen

On Wed, Mar 26, 2025 at 2:45 PM Chao Sun <[email protected]> wrote:

> +1
>
> On Tue, Mar 25, 2025 at 10:22 PM Ángel Álvarez Pascua <
> [email protected]> wrote:
>
>> I meant ... a data validation API would be great, but why in the  DSv2?
>> isn't data validation something more general? do we have to use DSv2 to
>> have our data validated?
>>
>> El mié, 26 mar 2025, 6:15, Ángel Álvarez Pascua <
>> [email protected]> escribió:
>>
>>> For me, data validation is one thing, and exporting that data to an
>>> external system is something entirely different. Should data validation be
>>> coupled with the external system? I don't think so. But since I'm the only
>>> one arguing against this proposal, does that mean I'm wrong?
>>>
>>> El mié, 26 mar 2025, 6:05, Wenchen Fan <[email protected]> escribió:
>>>
>>>> +1
>>>>
>>>> As Gengliang explained, the API allows the connectors to request Spark
>>>> to perform data validations, but connectors can also choose to do
>>>> validation by themselves. I think it's a reasonable design as not all
>>>> connectors have the ability to do data validation by themselves, such as
>>>> file formats that do not have a backend service.
>>>>
>>>> On Wed, Mar 26, 2025 at 12:56 AM Gengliang Wang <[email protected]>
>>>> wrote:
>>>>
>>>>> Hi Ángel,
>>>>>
>>>>> Thanks for the feedback. Besides the existing NOT NULL constraint, the
>>>>> proposal suggests enforcing only *check constraints *by default in
>>>>> Spark, as they’re straightforward and practical to validate at the engine
>>>>> level. Additionally, the SPIP proposes allowing connectors (like JDBC) to
>>>>> handle constraint validation externally:
>>>>>
>>>>> Some connectors, like JDBC, may skip validation in Spark and simply
>>>>>> pass the constraint through. These connectors must declare
>>>>>> ACCEPT_UNVALIDATED_CONSTRAINTS in their table capabilities, indicating 
>>>>>> they
>>>>>> would handle constraint enforcement themselves.
>>>>>
>>>>>
>>>>> This approach should help improve data accuracy and consistency by
>>>>> clearly defining responsibilities and enforcing constraints closer to 
>>>>> where
>>>>> they’re best managed.
>>>>>
>>>>>
>>>>> On Sat, Mar 22, 2025 at 12:32 AM Ángel Álvarez Pascua <
>>>>> [email protected]> wrote:
>>>>>
>>>>>> One thing is enforcing the quality of the data Spark is producing,
>>>>>> and another thing entirely is defining an external data model from Spark.
>>>>>>
>>>>>>
>>>>>> The proposal doesn’t necessarily facilitate data accuracy and
>>>>>> consistency. Defining constraints does help with that, but the question
>>>>>> remains: Is Spark truly responsible for enforcing those constraints on an
>>>>>> external system?
>>>>>>
>>>>>> El vie, 21 mar 2025 a las 21:29, Anton Okolnychyi (<
>>>>>> [email protected]>) escribió:
>>>>>>
>>>>>>> -1 (non-binding): Breaks the Chain of Responsibility. Constraints
>>>>>>>> should be defined and enforced by the data sources themselves, not 
>>>>>>>> Spark.
>>>>>>>> Spark is a processing engine, and enforcing constraints at this level 
>>>>>>>> blurs
>>>>>>>> architectural boundaries, making Spark responsible for something it 
>>>>>>>> does
>>>>>>>> not control.
>>>>>>>>
>>>>>>>
>>>>>>> I disagree that this breaks the chain of responsibility. It may be
>>>>>>> quite the opposite, in fact. Spark is already responsible for enforcing 
>>>>>>> NOT
>>>>>>> NULL constraints by adding AssertNotNull for required columns today.
>>>>>>> Connectors like Iceberg and Delta store constraint definitions but rely 
>>>>>>> on
>>>>>>> engines like Spark to enforce them during INSERT, DELETE, UPDATE, and 
>>>>>>> MERGE
>>>>>>> operations. Without this API, each connector would need to reimplement 
>>>>>>> the
>>>>>>> same logic, creating duplication.
>>>>>>>
>>>>>>> The proposal is aligned with the SQL standard and other relational
>>>>>>> databases. In my view, it simply makes Spark a better engine, 
>>>>>>> facilitates
>>>>>>> data accuracy and consistency, and enables performance optimizations.
>>>>>>>
>>>>>>> - Anton
>>>>>>>
>>>>>>> пт, 21 бер. 2025 р. о 12:59 Ángel Álvarez Pascua <
>>>>>>> [email protected]> пише:
>>>>>>>
>>>>>>>> -1 (non-binding): Breaks the Chain of Responsibility. Constraints
>>>>>>>> should be defined and enforced by the data sources themselves, not 
>>>>>>>> Spark.
>>>>>>>> Spark is a processing engine, and enforcing constraints at this level 
>>>>>>>> blurs
>>>>>>>> architectural boundaries, making Spark responsible for something it 
>>>>>>>> does
>>>>>>>> not control.
>>>>>>>>
>>>>>>>> El vie, 21 mar 2025 a las 20:18, L. C. Hsieh (<[email protected]>)
>>>>>>>> escribió:
>>>>>>>>
>>>>>>>>> +1
>>>>>>>>>
>>>>>>>>> On Fri, Mar 21, 2025 at 12:13 PM huaxin gao <
>>>>>>>>> [email protected]> wrote:
>>>>>>>>> >
>>>>>>>>> > +1
>>>>>>>>> >
>>>>>>>>> > On Fri, Mar 21, 2025 at 12:08 PM Denny Lee <
>>>>>>>>> [email protected]> wrote:
>>>>>>>>> >>
>>>>>>>>> >> +1 (non-binding)
>>>>>>>>> >>
>>>>>>>>> >> On Fri, Mar 21, 2025 at 11:52 Gengliang Wang <[email protected]>
>>>>>>>>> wrote:
>>>>>>>>> >>>
>>>>>>>>> >>> +1
>>>>>>>>> >>>
>>>>>>>>> >>> On Fri, Mar 21, 2025 at 11:46 AM Anton Okolnychyi <
>>>>>>>>> [email protected]> wrote:
>>>>>>>>> >>>>
>>>>>>>>> >>>> Hi all,
>>>>>>>>> >>>>
>>>>>>>>> >>>> I would like to start a vote on adding support for
>>>>>>>>> constraints to DSv2.
>>>>>>>>> >>>>
>>>>>>>>> >>>> Discussion thread:
>>>>>>>>> https://lists.apache.org/thread/njqjcryq0lot9rkbf10mtvf7d1t602bj
>>>>>>>>> >>>> SPIP:
>>>>>>>>> https://docs.google.com/document/d/1EHjB4W1LjiXxsK_G7067j9pPX0y15LUF1Z5DlUPoPIo
>>>>>>>>> >>>> PR with the API changes:
>>>>>>>>> https://github.com/apache/spark/pull/50253
>>>>>>>>> >>>> JIRA: https://issues.apache.org/jira/browse/SPARK-51207
>>>>>>>>> >>>>
>>>>>>>>> >>>> Please vote on the SPIP for the next 72 hours:
>>>>>>>>> >>>>
>>>>>>>>> >>>> [ ] +1: Accept the proposal as an official SPIP
>>>>>>>>> >>>> [ ] +0
>>>>>>>>> >>>> [ ] -1: I don’t think this is a good idea because …
>>>>>>>>> >>>>
>>>>>>>>> >>>> - Anton
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> ---------------------------------------------------------------------
>>>>>>>>> To unsubscribe e-mail: [email protected]
>>>>>>>>>
>>>>>>>>>

Re: [VOTE] SPIP: Constraints in DSv2

Reply via email to