Re: Profiling data quality with Spark

2022-12-27 Thread Gourav Sengupta
Hi Sean, the entire narrative of SPARK being a unified analytics tool falls flat as what should have been an engine on SPARK is now deliberately floated off as a separate company called as Ray, and all the unified narrative rings hollow. SPARK is nothing more than a SQL engine as per SPARKs own c

Re: Profiling data quality with Spark

2022-12-27 Thread vaquar khan
I would suggest Deequ , I have implemented many time easy and effective. Regards, Vaquar khan On Tue, Dec 27, 2022, 10:30 PM ayan guha wrote: > The way I would approach is to evaluate GE, Deequ (there is a python > binding called pydeequ) and others like Delta Live tables with expectations > f

Re: Profiling data quality with Spark

2022-12-27 Thread ayan guha
The way I would approach is to evaluate GE, Deequ (there is a python binding called pydeequ) and others like Delta Live tables with expectations from Data Quality feature perspective. All these tools have their pros and cons, and all of them are compatible with spark as a compute engine. Also, you

Re: Profiling data quality with Spark

2022-12-27 Thread Walaa Eldin Moustafa
Rajat, You might want to read about Data Sentinel, a data validation tool on Spark that is developed at LinkedIn. https://engineering.linkedin.com/blog/2020/data-sentinel-automating-data-validation The project is not open source, but the blog post might give you insights about how such a system

Re: Profiling data quality with Spark

2022-12-27 Thread Sean Owen
I think this is kind of mixed up. Data warehouses are simple SQL creatures; Spark is (also) a distributed compute framework. Kind of like comparing maybe a web server to Java. Are you thinking of Spark SQL? then I dunno sure you may well find it more complicated, but it's also just a data warehouse

Re: Profiling data quality with Spark

2022-12-27 Thread Gourav Sengupta
Hi, SPARK is just another querying engine with a lot of hype. I would highly suggest using Redshift (storage and compute decoupled mode) or Snowflake without all this super complicated understanding of containers/ disk-space, mind numbing variables, rocket science tuning, hair splitting failure s

Re: Profiling data quality with Spark

2022-12-27 Thread Mich Talebzadeh
Well, you need to qualify your statement on data quality. Are you talking about data lineage here? HTH view my Linkedin profile https://en.everybodywiki.com/Mich_Talebzadeh *Disclaimer:* Use it at your own risk. Any and all res

Profiling data quality with Spark

2022-12-27 Thread rajat kumar
Hi Folks Hoping you are doing well, I want to implement data quality to detect issues in data in advance. I have heard about few frameworks like GE/Deequ. Can anyone pls suggest which one is good and how do I get started on it? Regards Rajat

[Spark Core] [Advanced] [How-to] How to map any external field to job ids spawned by Spark.

2022-12-27 Thread Dhruv Toshniwal
TL;Dr - how-to-map-external-request-ids-to-spark-job-ids-for-spark-instrumentation Hi team, We are the engineering team of Mindtickle Inc. and we have a use-case where we want

Re: spark-submit fails in kubernetes 1.24.x cluster

2022-12-27 Thread Saurabh Gulati
Hello Thimme, Your issue is related to https://kubernetes.io/docs/reference/using-api/deprecation-guide/#ingress-v122 Deprecated API Migration Guide | Kubernetes As the Kubernetes API evolves, APIs are periodically re