Re: Data Contracts

2023-07-16 Thread Phillip Henry
No worries. Have you had a chance to look at it? Since this thread has gone dead, I assume there is no appetite for adding data contract functionality..? Regards, Phillip On Mon, 19 Jun 2023, 11:23 Deepak Sharma, wrote: > Sorry for using simple in my last email . > It’s not gonna to be simpl

Re: Data Contracts

2023-06-19 Thread Deepak Sharma
Sorry for using simple in my last email . It’s not gonna to be simple in any terms . Thanks for sharing the git Philip . Will definitely go through it . Thanks Deepak On Mon, 19 Jun 2023 at 3:47 PM, Phillip Henry wrote: > I think it might be a bit more complicated than this (but happy to be > p

Re: Data Contracts

2023-06-19 Thread Phillip Henry
I think it might be a bit more complicated than this (but happy to be proved wrong). I have a minimum working example at: https://github.com/PhillHenry/SparkConstraints.git that runs out-of-the-box (mvn test) and demonstrates what I am trying to achieve. A test persists a DataFrame that conform

Re: Data Contracts

2023-06-19 Thread Deepak Sharma
It can be as simple as adding a function to the spark session builder specifically on the read which can take the yaml file(definition if data co tracts to be in yaml) and apply it to the data frame . It can ignore the rows not matching the data contracts defined in the yaml . Thanks Deepak On M

Re: Data Contracts

2023-06-19 Thread Phillip Henry
For my part, I'm not too concerned about the mechanism used to implement the validation as long as it's rich enough to express the constraints. I took a look at JSON Schemas (for which there are a number of JVM implementations) but I don't think it can handle more complex data types like dates. Ma

Re: Data Contracts

2023-06-17 Thread Mich Talebzadeh
It would be interesting if we think about creating a contract validation library written in JSON format. This would ensure a validation mechanism that will rely on this library and could be shared among relevant parties. Will that be a starting point? HTH Mich Talebzadeh, Lead Solutions Architect

Re: Data Contracts

2023-06-14 Thread Jean-Georges Perrin
Hi, While I was at PayPal, we open sourced a template of Data Contract, it is here: https://github.com/paypal/data-contract-template. Companies like GX (Great Expectations) are interested in using it. Spark could read some elements form it pretty easily, like schema validation, some rules vali

Re: Data Contracts

2023-06-13 Thread Mich Talebzadeh
>From my limited understanding of data contracts, there are two factors that deem necessary. 1. procedure matter 2. technical matter I mean this is nothing new. Some tools like Cloud data fusion can assist when the procedures are validated. Simply "The process of integrating multiple data

Re: Data Contracts

2023-06-13 Thread Phillip Henry
Hi, Fokko and Deepak. The problem with DBT and Great Expectations (and Soda too, I believe) is that by the time they find the problem, the error is already in production - and fixing production can be a nightmare. What's more, we've found that nobody ever looks at the data quality reports we alre

Re: Data Contracts

2023-06-13 Thread Fokko Driesprong
Hey Phillip, Thanks for raising this. I like the idea. The question is, should this be implemented in Spark or some other framework? I know that dbt has a fairly extensive way of testing your data , and making sure that you can enforce assumptions on t

Re: Data Contracts

2023-06-12 Thread Deepak Sharma
Spark can be used with tools like great expectations as well to implement the data contracts . I am not sure though if spark alone can do the data contracts . I was reading a blog on data mesh and how to glue it together with data contracts , that’s where I came across this spark and great expectat

Re: Data Contracts

2023-06-12 Thread Elliot West
Hi Phillip, While not as fine-grained as your example, there do exist schema systems such as that in Avro that can can evaluate compatible and incompatible changes to the schema, from the perspective of the reader, writer, or both. This provides some potential degree of enforcement, and means to c

Re: Data Contracts

2023-06-12 Thread Ryan Blue
Hey Phillip, You're right that we can improve tooling to help with data contracts, but I think that a contract still needs to be an agreement between people. Constraints help by helping to ensure a data producer adheres to the contract and gives feedback as soon as possible when assumptions are vi