+1
在 2025-03-26 14:45:09,"Chao Sun" <sunc...@apache.org> 写道: +1 On Tue, Mar 25, 2025 at 10:22 PM Ángel Álvarez Pascua <angel.alvarez.pas...@gmail.com> wrote: I meant ... a data validation API would be great, but why in the DSv2? isn't data validation something more general? do we have to use DSv2 to have our data validated? El mié, 26 mar 2025, 6:15, Ángel Álvarez Pascua <angel.alvarez.pas...@gmail.com> escribió: For me, data validation is one thing, and exporting that data to an external system is something entirely different. Should data validation be coupled with the external system? I don't think so. But since I'm the only one arguing against this proposal, does that mean I'm wrong? El mié, 26 mar 2025, 6:05, Wenchen Fan <cloud0...@gmail.com> escribió: +1 As Gengliang explained, the API allows the connectors to request Spark to perform data validations, but connectors can also choose to do validation by themselves. I think it's a reasonable design as not all connectors have the ability to do data validation by themselves, such as file formats that do not have a backend service. On Wed, Mar 26, 2025 at 12:56 AM Gengliang Wang <ltn...@gmail.com> wrote: Hi Ángel, Thanks for the feedback. Besides the existing NOT NULL constraint, the proposal suggests enforcing only check constraints by default in Spark, as they’re straightforward and practical to validate at the engine level. Additionally, the SPIP proposes allowing connectors (like JDBC) to handle constraint validation externally: Some connectors, like JDBC, may skip validation in Spark and simply pass the constraint through. These connectors must declare ACCEPT_UNVALIDATED_CONSTRAINTS in their table capabilities, indicating they would handle constraint enforcement themselves. This approach should help improve data accuracy and consistency by clearly defining responsibilities and enforcing constraints closer to where they’re best managed. On Sat, Mar 22, 2025 at 12:32 AM Ángel Álvarez Pascua <angel.alvarez.pas...@gmail.com> wrote: One thing is enforcing the quality of the data Spark is producing, and another thing entirely is defining an external data model from Spark. The proposal doesn’t necessarily facilitate data accuracy and consistency. Defining constraints does help with that, but the question remains: Is Spark truly responsible for enforcing those constraints on an external system? El vie, 21 mar 2025 a las 21:29, Anton Okolnychyi (<aokolnyc...@gmail.com>) escribió: -1 (non-binding): Breaks the Chain of Responsibility. Constraints should be defined and enforced by the data sources themselves, not Spark. Spark is a processing engine, and enforcing constraints at this level blurs architectural boundaries, making Spark responsible for something it does not control. I disagree that this breaks the chain of responsibility. It may be quite the opposite, in fact. Spark is already responsible for enforcing NOT NULL constraints by adding AssertNotNull for required columns today. Connectors like Iceberg and Delta store constraint definitions but rely on engines like Spark to enforce them during INSERT, DELETE, UPDATE, and MERGE operations. Without this API, each connector would need to reimplement the same logic, creating duplication. The proposal is aligned with the SQL standard and other relational databases. In my view, it simply makes Spark a better engine, facilitates data accuracy and consistency, and enables performance optimizations. - Anton пт, 21 бер. 2025 р. о 12:59 Ángel Álvarez Pascua <angel.alvarez.pas...@gmail.com> пише: -1 (non-binding): Breaks the Chain of Responsibility. Constraints should be defined and enforced by the data sources themselves, not Spark. Spark is a processing engine, and enforcing constraints at this level blurs architectural boundaries, making Spark responsible for something it does not control. El vie, 21 mar 2025 a las 20:18, L. C. Hsieh (<vii...@gmail.com>) escribió: +1 On Fri, Mar 21, 2025 at 12:13 PM huaxin gao <huaxin.ga...@gmail.com> wrote: > > +1 > > On Fri, Mar 21, 2025 at 12:08 PM Denny Lee <denny.g....@gmail.com> wrote: >> >> +1 (non-binding) >> >> On Fri, Mar 21, 2025 at 11:52 Gengliang Wang <ltn...@gmail.com> wrote: >>> >>> +1 >>> >>> On Fri, Mar 21, 2025 at 11:46 AM Anton Okolnychyi <aokolnyc...@gmail.com> >>> wrote: >>>> >>>> Hi all, >>>> >>>> I would like to start a vote on adding support for constraints to DSv2. >>>> >>>> Discussion thread: >>>> https://lists.apache.org/thread/njqjcryq0lot9rkbf10mtvf7d1t602bj >>>> SPIP: >>>> https://docs.google.com/document/d/1EHjB4W1LjiXxsK_G7067j9pPX0y15LUF1Z5DlUPoPIo >>>> PR with the API changes: https://github.com/apache/spark/pull/50253 >>>> JIRA: https://issues.apache.org/jira/browse/SPARK-51207 >>>> >>>> Please vote on the SPIP for the next 72 hours: >>>> >>>> [ ] +1: Accept the proposal as an official SPIP >>>> [ ] +0 >>>> [ ] -1: I don’t think this is a good idea because … >>>> >>>> - Anton --------------------------------------------------------------------- To unsubscribe e-mail: dev-unsubscr...@spark.apache.org