For me, data validation is one thing, and exporting that data to an external system is something entirely different. Should data validation be coupled with the external system? I don't think so. But since I'm the only one arguing against this proposal, does that mean I'm wrong?
El mié, 26 mar 2025, 6:05, Wenchen Fan <cloud0...@gmail.com> escribió: > +1 > > As Gengliang explained, the API allows the connectors to request Spark to > perform data validations, but connectors can also choose to do validation > by themselves. I think it's a reasonable design as not all connectors have > the ability to do data validation by themselves, such as file formats that > do not have a backend service. > > On Wed, Mar 26, 2025 at 12:56 AM Gengliang Wang <ltn...@gmail.com> wrote: > >> Hi Ángel, >> >> Thanks for the feedback. Besides the existing NOT NULL constraint, the >> proposal suggests enforcing only *check constraints *by default in >> Spark, as they’re straightforward and practical to validate at the engine >> level. Additionally, the SPIP proposes allowing connectors (like JDBC) to >> handle constraint validation externally: >> >> Some connectors, like JDBC, may skip validation in Spark and simply pass >>> the constraint through. These connectors must declare >>> ACCEPT_UNVALIDATED_CONSTRAINTS in their table capabilities, indicating they >>> would handle constraint enforcement themselves. >> >> >> This approach should help improve data accuracy and consistency by >> clearly defining responsibilities and enforcing constraints closer to where >> they’re best managed. >> >> >> On Sat, Mar 22, 2025 at 12:32 AM Ángel Álvarez Pascua < >> angel.alvarez.pas...@gmail.com> wrote: >> >>> One thing is enforcing the quality of the data Spark is producing, and >>> another thing entirely is defining an external data model from Spark. >>> >>> >>> The proposal doesn’t necessarily facilitate data accuracy and >>> consistency. Defining constraints does help with that, but the question >>> remains: Is Spark truly responsible for enforcing those constraints on an >>> external system? >>> >>> El vie, 21 mar 2025 a las 21:29, Anton Okolnychyi (< >>> aokolnyc...@gmail.com>) escribió: >>> >>>> -1 (non-binding): Breaks the Chain of Responsibility. Constraints >>>>> should be defined and enforced by the data sources themselves, not Spark. >>>>> Spark is a processing engine, and enforcing constraints at this level >>>>> blurs >>>>> architectural boundaries, making Spark responsible for something it does >>>>> not control. >>>>> >>>> >>>> I disagree that this breaks the chain of responsibility. It may be >>>> quite the opposite, in fact. Spark is already responsible for enforcing NOT >>>> NULL constraints by adding AssertNotNull for required columns today. >>>> Connectors like Iceberg and Delta store constraint definitions but rely on >>>> engines like Spark to enforce them during INSERT, DELETE, UPDATE, and MERGE >>>> operations. Without this API, each connector would need to reimplement the >>>> same logic, creating duplication. >>>> >>>> The proposal is aligned with the SQL standard and other relational >>>> databases. In my view, it simply makes Spark a better engine, facilitates >>>> data accuracy and consistency, and enables performance optimizations. >>>> >>>> - Anton >>>> >>>> пт, 21 бер. 2025 р. о 12:59 Ángel Álvarez Pascua < >>>> angel.alvarez.pas...@gmail.com> пише: >>>> >>>>> -1 (non-binding): Breaks the Chain of Responsibility. Constraints >>>>> should be defined and enforced by the data sources themselves, not Spark. >>>>> Spark is a processing engine, and enforcing constraints at this level >>>>> blurs >>>>> architectural boundaries, making Spark responsible for something it does >>>>> not control. >>>>> >>>>> El vie, 21 mar 2025 a las 20:18, L. C. Hsieh (<vii...@gmail.com>) >>>>> escribió: >>>>> >>>>>> +1 >>>>>> >>>>>> On Fri, Mar 21, 2025 at 12:13 PM huaxin gao <huaxin.ga...@gmail.com> >>>>>> wrote: >>>>>> > >>>>>> > +1 >>>>>> > >>>>>> > On Fri, Mar 21, 2025 at 12:08 PM Denny Lee <denny.g....@gmail.com> >>>>>> wrote: >>>>>> >> >>>>>> >> +1 (non-binding) >>>>>> >> >>>>>> >> On Fri, Mar 21, 2025 at 11:52 Gengliang Wang <ltn...@gmail.com> >>>>>> wrote: >>>>>> >>> >>>>>> >>> +1 >>>>>> >>> >>>>>> >>> On Fri, Mar 21, 2025 at 11:46 AM Anton Okolnychyi < >>>>>> aokolnyc...@gmail.com> wrote: >>>>>> >>>> >>>>>> >>>> Hi all, >>>>>> >>>> >>>>>> >>>> I would like to start a vote on adding support for constraints >>>>>> to DSv2. >>>>>> >>>> >>>>>> >>>> Discussion thread: >>>>>> https://lists.apache.org/thread/njqjcryq0lot9rkbf10mtvf7d1t602bj >>>>>> >>>> SPIP: >>>>>> https://docs.google.com/document/d/1EHjB4W1LjiXxsK_G7067j9pPX0y15LUF1Z5DlUPoPIo >>>>>> >>>> PR with the API changes: >>>>>> https://github.com/apache/spark/pull/50253 >>>>>> >>>> JIRA: https://issues.apache.org/jira/browse/SPARK-51207 >>>>>> >>>> >>>>>> >>>> Please vote on the SPIP for the next 72 hours: >>>>>> >>>> >>>>>> >>>> [ ] +1: Accept the proposal as an official SPIP >>>>>> >>>> [ ] +0 >>>>>> >>>> [ ] -1: I don’t think this is a good idea because … >>>>>> >>>> >>>>>> >>>> - Anton >>>>>> >>>>>> --------------------------------------------------------------------- >>>>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org >>>>>> >>>>>>