Re: Profiling data quality with Spark

rajat kumar Wed, 28 Dec 2022 07:41:11 -0800

Thanks for the input folks.

Hi Vaquar ,


I saw that we have various types of checks in GE and Deequ. Could you
please suggest what types of check did you use for Metric based columns


Regards
Rajat

On Wed, Dec 28, 2022 at 12:15 PM vaquar khan <vaquar.k...@gmail.com> wrote:

> I would suggest Deequ , I have implemented many time easy and effective.
>
>
> Regards,
> Vaquar khan
>
> On Tue, Dec 27, 2022, 10:30 PM ayan guha <guha.a...@gmail.com> wrote:
>
>> The way I would approach is to evaluate GE, Deequ (there is a python
>> binding called pydeequ) and others like Delta Live tables with expectations
>> from Data Quality feature perspective. All these tools have their pros and
>> cons, and all of them are compatible with spark as a compute engine.
>>
>> Also, you may want to look at dbt based DQ toolsets if sql is your thing.
>>
>> On Wed, 28 Dec 2022 at 3:14 pm, Sean Owen <sro...@gmail.com> wrote:
>>
>>> I think this is kind of mixed up. Data warehouses are simple SQL
>>> creatures; Spark is (also) a distributed compute framework. Kind of like
>>> comparing maybe a web server to Java.
>>> Are you thinking of Spark SQL? then I dunno sure you may well find it
>>> more complicated, but it's also just a data warehousey SQL surface.
>>>
>>> But none of that relates to the question of data quality tools. You
>>> could use GE with Redshift, or indeed with Spark - are you familiar with
>>> it? It's probably one of the most common tools people use with Spark for
>>> this in fact. It's just a Python lib at heart and you can apply it with
>>> Spark, but _not_ with a data warehouse, so I'm not sure what you're getting
>>> at.
>>>
>>> Deequ is also commonly seen. It's actually built on Spark, so again,
>>> confused about this "use Redshift or Snowflake not Spark".
>>>
>>> On Tue, Dec 27, 2022 at 9:55 PM Gourav Sengupta <
>>> gourav.sengu...@gmail.com> wrote:
>>>
>>>> Hi,
>>>>
>>>> SPARK is just another querying engine with a lot of hype.
>>>>
>>>> I would highly suggest using Redshift (storage and compute decoupled
>>>> mode) or Snowflake without all this super complicated understanding of
>>>> containers/ disk-space, mind numbing variables, rocket science tuning, hair
>>>> splitting failure scenarios, etc. After that try to choose solutions like
>>>> Athena, or Trino/ Presto, and then come to SPARK.
>>>>
>>>> Try out solutions like  "great expectations" if you are looking for
>>>> data quality and not entirely sucked into the world of SPARK and want to
>>>> keep your options open.
>>>>
>>>> Dont get me wrong, SPARK used to be great in 2016-2017, but there are
>>>> superb alternatives now and the industry, in this recession, should focus
>>>> on getting more value for every single dollar they spend.
>>>>
>>>> Best of luck.
>>>>
>>>> Regards,
>>>> Gourav Sengupta
>>>>
>>>> On Tue, Dec 27, 2022 at 7:30 PM Mich Talebzadeh <
>>>> mich.talebza...@gmail.com> wrote:
>>>>
>>>>> Well, you need to qualify your statement on data quality. Are you
>>>>> talking about data lineage here?
>>>>>
>>>>> HTH
>>>>>
>>>>>
>>>>>
>>>>>    view my Linkedin profile
>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>>
>>>>>
>>>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>>>
>>>>>
>>>>>
>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>>>> any loss, damage or destruction of data or any other property which may
>>>>> arise from relying on this email's technical content is explicitly
>>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>>> arising from such loss, damage or destruction.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Tue, 27 Dec 2022 at 19:25, rajat kumar <kumar.rajat20...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hi Folks
>>>>>> Hoping you are doing well, I want to implement data quality to detect
>>>>>> issues in data in advance. I have heard about few frameworks like 
>>>>>> GE/Deequ.
>>>>>> Can anyone pls suggest which one is good and how do I get started on it?
>>>>>>
>>>>>> Regards
>>>>>> Rajat
>>>>>>
>>>>> --
>> Best Regards,
>> Ayan Guha
>>
>

Re: Profiling data quality with Spark

Reply via email to