Rajat, You might want to read about Data Sentinel, a data validation tool on Spark that is developed at LinkedIn.
https://engineering.linkedin.com/blog/2020/data-sentinel-automating-data-validation The project is not open source, but the blog post might give you insights about how such a system could be built. Thanks, Walaa. On Tue, Dec 27, 2022 at 8:13 PM Sean Owen <sro...@gmail.com> wrote: > I think this is kind of mixed up. Data warehouses are simple SQL > creatures; Spark is (also) a distributed compute framework. Kind of like > comparing maybe a web server to Java. > Are you thinking of Spark SQL? then I dunno sure you may well find it more > complicated, but it's also just a data warehousey SQL surface. > > But none of that relates to the question of data quality tools. You could > use GE with Redshift, or indeed with Spark - are you familiar with it? It's > probably one of the most common tools people use with Spark for this in > fact. It's just a Python lib at heart and you can apply it with Spark, but > _not_ with a data warehouse, so I'm not sure what you're getting at. > > Deequ is also commonly seen. It's actually built on Spark, so again, > confused about this "use Redshift or Snowflake not Spark". > > On Tue, Dec 27, 2022 at 9:55 PM Gourav Sengupta <gourav.sengu...@gmail.com> > wrote: > >> Hi, >> >> SPARK is just another querying engine with a lot of hype. >> >> I would highly suggest using Redshift (storage and compute decoupled >> mode) or Snowflake without all this super complicated understanding of >> containers/ disk-space, mind numbing variables, rocket science tuning, hair >> splitting failure scenarios, etc. After that try to choose solutions like >> Athena, or Trino/ Presto, and then come to SPARK. >> >> Try out solutions like "great expectations" if you are looking for data >> quality and not entirely sucked into the world of SPARK and want to keep >> your options open. >> >> Dont get me wrong, SPARK used to be great in 2016-2017, but there are >> superb alternatives now and the industry, in this recession, should focus >> on getting more value for every single dollar they spend. >> >> Best of luck. >> >> Regards, >> Gourav Sengupta >> >> On Tue, Dec 27, 2022 at 7:30 PM Mich Talebzadeh < >> mich.talebza...@gmail.com> wrote: >> >>> Well, you need to qualify your statement on data quality. Are you >>> talking about data lineage here? >>> >>> HTH >>> >>> >>> >>> view my Linkedin profile >>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> >>> >>> >>> https://en.everybodywiki.com/Mich_Talebzadeh >>> >>> >>> >>> *Disclaimer:* Use it at your own risk. Any and all responsibility for >>> any loss, damage or destruction of data or any other property which may >>> arise from relying on this email's technical content is explicitly >>> disclaimed. The author will in no case be liable for any monetary damages >>> arising from such loss, damage or destruction. >>> >>> >>> >>> >>> On Tue, 27 Dec 2022 at 19:25, rajat kumar <kumar.rajat20...@gmail.com> >>> wrote: >>> >>>> Hi Folks >>>> Hoping you are doing well, I want to implement data quality to detect >>>> issues in data in advance. I have heard about few frameworks like GE/Deequ. >>>> Can anyone pls suggest which one is good and how do I get started on it? >>>> >>>> Regards >>>> Rajat >>>> >>>