+1
As Gengliang explained, the API allows the connectors to request Spark to
perform data validations, but connectors can also choose to do validation
by themselves. I think it's a reasonable design as not all connectors have
the ability to do data validation by themselves, such as file formats th
Hi Angel,
This feature involves 3 parties:
- The end-user specifies constraints for their tables, via the SQL syntax
provided by Spark.
- Spark propagates the constraints to the backend connector of the tables,
and performs data validation during data writing if the connector asks
Spark to do so.
+1
On Fri, Mar 21, 2025 at 12:08 PM Denny Lee wrote:
> +1 (non-binding)
>
> On Fri, Mar 21, 2025 at 11:52 Gengliang Wang wrote:
>
>> +1
>>
>> On Fri, Mar 21, 2025 at 11:46 AM Anton Okolnychyi
>> wrote:
>>
>>> Hi all,
>>>
>>> I would like to start a vote on adding support for constraints to DSv
+1 (non binding)
Agree with Anton, data sources like the open table formats define the
requirement, and definitely need engines to write to it accordingly.
Thanks,
Szehon
On Fri, Mar 21, 2025 at 1:31 PM Anton Okolnychyi
wrote:
> -1 (non-binding): Breaks the Chain of Responsibility. Constraints
Casting my own +1 (non-binding).
Angel, I echo what Wenchen said. Connectors and Spark interact via DSv2,
therefore it requires changes in that layer. It is going to be optional but
will make a ton of sense for many connectors, especially in modern open
table formats that decouple table metadata f
+1
在 2025-03-26 14:45:09,"Chao Sun" 写道:
+1
On Tue, Mar 25, 2025 at 10:22 PM Ángel Álvarez Pascua
wrote:
I meant ... a data validation API would be great, but why in the DSv2? isn't
data validation something more general? do we have to use DSv2 to have our data
validated?
El mié, 26
+1
On Tue, Mar 25, 2025 at 10:22 PM Ángel Álvarez Pascua <
angel.alvarez.pas...@gmail.com> wrote:
> I meant ... a data validation API would be great, but why in the DSv2?
> isn't data validation something more general? do we have to use DSv2 to
> have our data validated?
>
> El mié, 26 mar 2025,
I meant ... a data validation API would be great, but why in the DSv2?
isn't data validation something more general? do we have to use DSv2 to
have our data validated?
El mié, 26 mar 2025, 6:15, Ángel Álvarez Pascua <
angel.alvarez.pas...@gmail.com> escribió:
> For me, data validation is one thi
For me, data validation is one thing, and exporting that data to an
external system is something entirely different. Should data validation be
coupled with the external system? I don't think so. But since I'm the only
one arguing against this proposal, does that mean I'm wrong?
El mié, 26 mar 2025
Hi Ángel,
Thanks for the feedback. Besides the existing NOT NULL constraint, the
proposal suggests enforcing only *check constraints *by default in Spark,
as they’re straightforward and practical to validate at the engine level.
Additionally, the SPIP proposes allowing connectors (like JDBC) to ha
+1
On Mon, 24 Mar 2025 at 09:57, Jungtaek Lim
wrote:
> +1 (non-binding)
>
> Thanks for initiating this!
>
> On Sun, Mar 23, 2025 at 3:45 AM serge rielau.com wrote:
>
>> +1 (non binding)
>>
>> On Mar 21, 2025, at 12:52 PM, Jules Damji wrote:
>>
>> +1 (non-binding)
>> —
>> Sent from my iPhone
>>
+1 (non-binding)
Thanks for initiating this!
On Sun, Mar 23, 2025 at 3:45 AM serge rielau.com wrote:
> +1 (non binding)
>
> On Mar 21, 2025, at 12:52 PM, Jules Damji wrote:
>
> +1 (non-binding)
> —
> Sent from my iPhone
> Pardon the dumb thumb typos :)
>
> On Mar 21, 2025, at 11:47 AM, Anton O
+1 (non binding)
On Mar 21, 2025, at 12:52 PM, Jules Damji wrote:
+1 (non-binding)
—
Sent from my iPhone
Pardon the dumb thumb typos :)
On Mar 21, 2025, at 11:47 AM, Anton Okolnychyi wrote:
Hi all,
I would like to start a vote on adding support for constraints to DSv2.
Discussion thread:
+1 (non-binding)
Thanks for working on this Anton! Some links to other engines that also did
something similar:
HIVE-13076 - https://issues.apache.org/jira/browse/HIVE-13076
IMPALA-3531 - https://issues.apache.org/jira/browse/IMPALA-3531
In fact, Spark had a very old Jira
SPARK-19842 - https://
+1
On Sat, Mar 22, 2025 at 7:01 PM Peter Toth wrote:
> +1
>
> On Fri, Mar 21, 2025 at 10:24 PM Szehon Ho
> wrote:
>
>> +1 (non binding)
>>
>> Agree with Anton, data sources like the open table formats define the
>> requirement, and definitely need engines to write to it accordingly.
>>
>> Thank
+1
On Fri, Mar 21, 2025 at 10:24 PM Szehon Ho wrote:
> +1 (non binding)
>
> Agree with Anton, data sources like the open table formats define the
> requirement, and definitely need engines to write to it accordingly.
>
> Thanks,
> Szehon
>
> On Fri, Mar 21, 2025 at 1:31 PM Anton Okolnychyi
> wr
+1Sent from my iPhoneOn Mar 21, 2025, at 2:25 PM, Szehon Ho wrote:+1 (non binding)Agree with Anton, data sources like the open table formats define the requirement, and definitely need engines to write to it accordingly.Thanks,SzehonOn Fri, Mar 21, 2025 at 1:31 PM Anton Okolnychyi
One thing is enforcing the quality of the data Spark is producing, and
another thing entirely is defining an external data model from Spark.
The proposal doesn’t necessarily facilitate data accuracy and consistency.
Defining constraints does help with that, but the question remains: Is
Spark trul
+1On Mar 21, 2025, at 12:15, huaxin gao wrote:+1On Fri, Mar 21, 2025 at 12:08 PM Denny Lee wrote:+1 (non-binding) On Fri, Mar 21, 2025 at 11:52 Gengliang Wang wrote:+1On Fri, Mar 21, 2025 at 11:46 AM Anton Okolnychyi wrote:Hi all,
>
> -1 (non-binding): Breaks the Chain of Responsibility. Constraints should
> be defined and enforced by the data sources themselves, not Spark. Spark is
> a processing engine, and enforcing constraints at this level blurs
> architectural boundaries, making Spark responsible for something it does
+1 (non-binding)
On Fri, Mar 21, 2025 at 11:52 Gengliang Wang wrote:
> +1
>
> On Fri, Mar 21, 2025 at 11:46 AM Anton Okolnychyi
> wrote:
>
>> Hi all,
>>
>> I would like to start a vote on adding support for constraints to DSv2.
>>
>> *Discussion thread: *
>> https://lists.apache.org/thread/njqj
-1 (non-binding): Breaks the Chain of Responsibility. Constraints should be
defined and enforced by the data sources themselves, not Spark. Spark is a
processing engine, and enforcing constraints at this level blurs
architectural boundaries, making Spark responsible for something it does
not contro
+1 (non-binding)
—
Sent from my iPhone
Pardon the dumb thumb typos :)
> On Mar 21, 2025, at 11:47 AM, Anton Okolnychyi wrote:
>
>
> Hi all,
>
> I would like to start a vote on adding support for constraints to DSv2.
>
> Discussion thread:
> https://lists.apache.org/thread/njqjcryq0lot9rkbf
+1
On Fri, Mar 21, 2025 at 12:13 PM huaxin gao wrote:
>
> +1
>
> On Fri, Mar 21, 2025 at 12:08 PM Denny Lee wrote:
>>
>> +1 (non-binding)
>>
>> On Fri, Mar 21, 2025 at 11:52 Gengliang Wang wrote:
>>>
>>> +1
>>>
>>> On Fri, Mar 21, 2025 at 11:46 AM Anton Okolnychyi
>>> wrote:
Hi all,
+1
On Fri, Mar 21, 2025 at 11:46 AM Anton Okolnychyi
wrote:
> Hi all,
>
> I would like to start a vote on adding support for constraints to DSv2.
>
> *Discussion thread: *
> https://lists.apache.org/thread/njqjcryq0lot9rkbf10mtvf7d1t602bj
> *SPIP:*
> https://docs.google.com/document/d/1EHjB4W1Lj
Hi all,
I would like to start a vote on adding support for constraints to DSv2.
*Discussion thread: *
https://lists.apache.org/thread/njqjcryq0lot9rkbf10mtvf7d1t602bj
*SPIP:*
https://docs.google.com/document/d/1EHjB4W1LjiXxsK_G7067j9pPX0y15LUF1Z5DlUPoPIo
*PR with the API changes:* https://github.
26 matches
Mail list logo