For me, data validation is one thing, and exporting that data to an
external system is something entirely different. Should data validation be
coupled with the external system? I don't think so. But since I'm the only
one arguing against this proposal, does that mean I'm wrong?

El mié, 26 mar 2025, 6:05, Wenchen Fan <cloud0...@gmail.com> escribió:

> +1
>
> As Gengliang explained, the API allows the connectors to request Spark to
> perform data validations, but connectors can also choose to do validation
> by themselves. I think it's a reasonable design as not all connectors have
> the ability to do data validation by themselves, such as file formats that
> do not have a backend service.
>
> On Wed, Mar 26, 2025 at 12:56 AM Gengliang Wang <ltn...@gmail.com> wrote:
>
>> Hi Ángel,
>>
>> Thanks for the feedback. Besides the existing NOT NULL constraint, the
>> proposal suggests enforcing only *check constraints *by default in
>> Spark, as they’re straightforward and practical to validate at the engine
>> level. Additionally, the SPIP proposes allowing connectors (like JDBC) to
>> handle constraint validation externally:
>>
>> Some connectors, like JDBC, may skip validation in Spark and simply pass
>>> the constraint through. These connectors must declare
>>> ACCEPT_UNVALIDATED_CONSTRAINTS in their table capabilities, indicating they
>>> would handle constraint enforcement themselves.
>>
>>
>> This approach should help improve data accuracy and consistency by
>> clearly defining responsibilities and enforcing constraints closer to where
>> they’re best managed.
>>
>>
>> On Sat, Mar 22, 2025 at 12:32 AM Ángel Álvarez Pascua <
>> angel.alvarez.pas...@gmail.com> wrote:
>>
>>> One thing is enforcing the quality of the data Spark is producing, and
>>> another thing entirely is defining an external data model from Spark.
>>>
>>>
>>> The proposal doesn’t necessarily facilitate data accuracy and
>>> consistency. Defining constraints does help with that, but the question
>>> remains: Is Spark truly responsible for enforcing those constraints on an
>>> external system?
>>>
>>> El vie, 21 mar 2025 a las 21:29, Anton Okolnychyi (<
>>> aokolnyc...@gmail.com>) escribió:
>>>
>>>> -1 (non-binding): Breaks the Chain of Responsibility. Constraints
>>>>> should be defined and enforced by the data sources themselves, not Spark.
>>>>> Spark is a processing engine, and enforcing constraints at this level 
>>>>> blurs
>>>>> architectural boundaries, making Spark responsible for something it does
>>>>> not control.
>>>>>
>>>>
>>>> I disagree that this breaks the chain of responsibility. It may be
>>>> quite the opposite, in fact. Spark is already responsible for enforcing NOT
>>>> NULL constraints by adding AssertNotNull for required columns today.
>>>> Connectors like Iceberg and Delta store constraint definitions but rely on
>>>> engines like Spark to enforce them during INSERT, DELETE, UPDATE, and MERGE
>>>> operations. Without this API, each connector would need to reimplement the
>>>> same logic, creating duplication.
>>>>
>>>> The proposal is aligned with the SQL standard and other relational
>>>> databases. In my view, it simply makes Spark a better engine, facilitates
>>>> data accuracy and consistency, and enables performance optimizations.
>>>>
>>>> - Anton
>>>>
>>>> пт, 21 бер. 2025 р. о 12:59 Ángel Álvarez Pascua <
>>>> angel.alvarez.pas...@gmail.com> пише:
>>>>
>>>>> -1 (non-binding): Breaks the Chain of Responsibility. Constraints
>>>>> should be defined and enforced by the data sources themselves, not Spark.
>>>>> Spark is a processing engine, and enforcing constraints at this level 
>>>>> blurs
>>>>> architectural boundaries, making Spark responsible for something it does
>>>>> not control.
>>>>>
>>>>> El vie, 21 mar 2025 a las 20:18, L. C. Hsieh (<vii...@gmail.com>)
>>>>> escribió:
>>>>>
>>>>>> +1
>>>>>>
>>>>>> On Fri, Mar 21, 2025 at 12:13 PM huaxin gao <huaxin.ga...@gmail.com>
>>>>>> wrote:
>>>>>> >
>>>>>> > +1
>>>>>> >
>>>>>> > On Fri, Mar 21, 2025 at 12:08 PM Denny Lee <denny.g....@gmail.com>
>>>>>> wrote:
>>>>>> >>
>>>>>> >> +1 (non-binding)
>>>>>> >>
>>>>>> >> On Fri, Mar 21, 2025 at 11:52 Gengliang Wang <ltn...@gmail.com>
>>>>>> wrote:
>>>>>> >>>
>>>>>> >>> +1
>>>>>> >>>
>>>>>> >>> On Fri, Mar 21, 2025 at 11:46 AM Anton Okolnychyi <
>>>>>> aokolnyc...@gmail.com> wrote:
>>>>>> >>>>
>>>>>> >>>> Hi all,
>>>>>> >>>>
>>>>>> >>>> I would like to start a vote on adding support for constraints
>>>>>> to DSv2.
>>>>>> >>>>
>>>>>> >>>> Discussion thread:
>>>>>> https://lists.apache.org/thread/njqjcryq0lot9rkbf10mtvf7d1t602bj
>>>>>> >>>> SPIP:
>>>>>> https://docs.google.com/document/d/1EHjB4W1LjiXxsK_G7067j9pPX0y15LUF1Z5DlUPoPIo
>>>>>> >>>> PR with the API changes:
>>>>>> https://github.com/apache/spark/pull/50253
>>>>>> >>>> JIRA: https://issues.apache.org/jira/browse/SPARK-51207
>>>>>> >>>>
>>>>>> >>>> Please vote on the SPIP for the next 72 hours:
>>>>>> >>>>
>>>>>> >>>> [ ] +1: Accept the proposal as an official SPIP
>>>>>> >>>> [ ] +0
>>>>>> >>>> [ ] -1: I don’t think this is a good idea because …
>>>>>> >>>>
>>>>>> >>>> - Anton
>>>>>>
>>>>>> ---------------------------------------------------------------------
>>>>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>>>>
>>>>>>

Reply via email to