Hi Anuj,

I am not familiar with data quality measurement methods and deequ
<https://github.com/awslabs/deequ> in depth.
What you describe looks like monitoring some data metrics.
Maybe, there are other community users aware of better solution.
Meanwhile, I would recommend to implement the checks and failures as
separate operators and side outputs (for streaming) [1], if not yet
Then you could also use Flink metrics to aggregate and monitor the data [2].
The metrics systems usually allow to define alerts on metrics, like in
prometheus [3], [4].

Best,
Andrey

[1]
https://ci.apache.org/projects/flink/flink-docs-release-1.10/dev/stream/side_output.html
[2]
https://ci.apache.org/projects/flink/flink-docs-release-1.10/monitoring/metrics.html
[3]
https://ci.apache.org/projects/flink/flink-docs-release-1.10/monitoring/metrics.html#prometheus-orgapacheflinkmetricsprometheusprometheusreporter
[4] https://prometheus.io/docs/alerting/overview/

On Sat, Jun 6, 2020 at 9:23 AM aj <ajainje...@gmail.com> wrote:

> Hello All,
>
> I  want to do some data quality analysis on stream data example.
>
> 1. Fill rate in a particular column
> 2. How many events are going to error queue due to favor schema
> validation failed?
> 3. Different statistics measure of a column.
> 3. Alert if a particular threshold is breached (like if fill rate is less
> than 90% for a column)
>
> Is there any library that exists on top of Flink for data quality. As I am
> looking there is a library on top of the spark
> https://github.com/awslabs/deequ
>
> This proved all that I am looking for.
>
> --
> Thanks & Regards,
> Anuj Jain
>
>
>
> <http://www.cse.iitm.ac.in/%7Eanujjain/>
>

Reply via email to