I would suggest Deequ , I have implemented many time easy and effective.
Regards, Vaquar khan On Tue, Dec 27, 2022, 10:30 PM ayan guha <guha.a...@gmail.com> wrote: > The way I would approach is to evaluate GE, Deequ (there is a python > binding called pydeequ) and others like Delta Live tables with expectations > from Data Quality feature perspective. All these tools have their pros and > cons, and all of them are compatible with spark as a compute engine. > > Also, you may want to look at dbt based DQ toolsets if sql is your thing. > > On Wed, 28 Dec 2022 at 3:14 pm, Sean Owen <sro...@gmail.com> wrote: > >> I think this is kind of mixed up. Data warehouses are simple SQL >> creatures; Spark is (also) a distributed compute framework. Kind of like >> comparing maybe a web server to Java. >> Are you thinking of Spark SQL? then I dunno sure you may well find it >> more complicated, but it's also just a data warehousey SQL surface. >> >> But none of that relates to the question of data quality tools. You could >> use GE with Redshift, or indeed with Spark - are you familiar with it? It's >> probably one of the most common tools people use with Spark for this in >> fact. It's just a Python lib at heart and you can apply it with Spark, but >> _not_ with a data warehouse, so I'm not sure what you're getting at. >> >> Deequ is also commonly seen. It's actually built on Spark, so again, >> confused about this "use Redshift or Snowflake not Spark". >> >> On Tue, Dec 27, 2022 at 9:55 PM Gourav Sengupta < >> gourav.sengu...@gmail.com> wrote: >> >>> Hi, >>> >>> SPARK is just another querying engine with a lot of hype. >>> >>> I would highly suggest using Redshift (storage and compute decoupled >>> mode) or Snowflake without all this super complicated understanding of >>> containers/ disk-space, mind numbing variables, rocket science tuning, hair >>> splitting failure scenarios, etc. After that try to choose solutions like >>> Athena, or Trino/ Presto, and then come to SPARK. >>> >>> Try out solutions like "great expectations" if you are looking for data >>> quality and not entirely sucked into the world of SPARK and want to keep >>> your options open. >>> >>> Dont get me wrong, SPARK used to be great in 2016-2017, but there are >>> superb alternatives now and the industry, in this recession, should focus >>> on getting more value for every single dollar they spend. >>> >>> Best of luck. >>> >>> Regards, >>> Gourav Sengupta >>> >>> On Tue, Dec 27, 2022 at 7:30 PM Mich Talebzadeh < >>> mich.talebza...@gmail.com> wrote: >>> >>>> Well, you need to qualify your statement on data quality. Are you >>>> talking about data lineage here? >>>> >>>> HTH >>>> >>>> >>>> >>>> view my Linkedin profile >>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> >>>> >>>> >>>> https://en.everybodywiki.com/Mich_Talebzadeh >>>> >>>> >>>> >>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for >>>> any loss, damage or destruction of data or any other property which may >>>> arise from relying on this email's technical content is explicitly >>>> disclaimed. The author will in no case be liable for any monetary damages >>>> arising from such loss, damage or destruction. >>>> >>>> >>>> >>>> >>>> On Tue, 27 Dec 2022 at 19:25, rajat kumar <kumar.rajat20...@gmail.com> >>>> wrote: >>>> >>>>> Hi Folks >>>>> Hoping you are doing well, I want to implement data quality to detect >>>>> issues in data in advance. I have heard about few frameworks like >>>>> GE/Deequ. >>>>> Can anyone pls suggest which one is good and how do I get started on it? >>>>> >>>>> Regards >>>>> Rajat >>>>> >>>> -- > Best Regards, > Ayan Guha >