Re: Profiling data quality with Spark

vaquar khan Tue, 27 Dec 2022 22:45:43 -0800

I would suggest Deequ , I have implemented many time easy and effective.


Regards,
Vaquar khan

On Tue, Dec 27, 2022, 10:30 PM ayan guha <guha.a...@gmail.com> wrote:

> The way I would approach is to evaluate GE, Deequ (there is a python
> binding called pydeequ) and others like Delta Live tables with expectations
> from Data Quality feature perspective. All these tools have their pros and
> cons, and all of them are compatible with spark as a compute engine.
>
> Also, you may want to look at dbt based DQ toolsets if sql is your thing.
>
> On Wed, 28 Dec 2022 at 3:14 pm, Sean Owen <sro...@gmail.com> wrote:
>
>> I think this is kind of mixed up. Data warehouses are simple SQL
>> creatures; Spark is (also) a distributed compute framework. Kind of like
>> comparing maybe a web server to Java.
>> Are you thinking of Spark SQL? then I dunno sure you may well find it
>> more complicated, but it's also just a data warehousey SQL surface.
>>
>> But none of that relates to the question of data quality tools. You could
>> use GE with Redshift, or indeed with Spark - are you familiar with it? It's
>> probably one of the most common tools people use with Spark for this in
>> fact. It's just a Python lib at heart and you can apply it with Spark, but
>> _not_ with a data warehouse, so I'm not sure what you're getting at.
>>
>> Deequ is also commonly seen. It's actually built on Spark, so again,
>> confused about this "use Redshift or Snowflake not Spark".
>>
>> On Tue, Dec 27, 2022 at 9:55 PM Gourav Sengupta <
>> gourav.sengu...@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> SPARK is just another querying engine with a lot of hype.
>>>
>>> I would highly suggest using Redshift (storage and compute decoupled
>>> mode) or Snowflake without all this super complicated understanding of
>>> containers/ disk-space, mind numbing variables, rocket science tuning, hair
>>> splitting failure scenarios, etc. After that try to choose solutions like
>>> Athena, or Trino/ Presto, and then come to SPARK.
>>>
>>> Try out solutions like  "great expectations" if you are looking for data
>>> quality and not entirely sucked into the world of SPARK and want to keep
>>> your options open.
>>>
>>> Dont get me wrong, SPARK used to be great in 2016-2017, but there are
>>> superb alternatives now and the industry, in this recession, should focus
>>> on getting more value for every single dollar they spend.
>>>
>>> Best of luck.
>>>
>>> Regards,
>>> Gourav Sengupta
>>>
>>> On Tue, Dec 27, 2022 at 7:30 PM Mich Talebzadeh <
>>> mich.talebza...@gmail.com> wrote:
>>>
>>>> Well, you need to qualify your statement on data quality. Are you
>>>> talking about data lineage here?
>>>>
>>>> HTH
>>>>
>>>>
>>>>
>>>>    view my Linkedin profile
>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>
>>>>
>>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>>
>>>>
>>>>
>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>>> any loss, damage or destruction of data or any other property which may
>>>> arise from relying on this email's technical content is explicitly
>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>> arising from such loss, damage or destruction.
>>>>
>>>>
>>>>
>>>>
>>>> On Tue, 27 Dec 2022 at 19:25, rajat kumar <kumar.rajat20...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi Folks
>>>>> Hoping you are doing well, I want to implement data quality to detect
>>>>> issues in data in advance. I have heard about few frameworks like 
>>>>> GE/Deequ.
>>>>> Can anyone pls suggest which one is good and how do I get started on it?
>>>>>
>>>>> Regards
>>>>> Rajat
>>>>>
>>>> --
> Best Regards,
> Ayan Guha
>

Re: Profiling data quality with Spark

Reply via email to