Hi Anuj, I am not familiar with data quality measurement methods and deequ <https://github.com/awslabs/deequ> in depth. What you describe looks like monitoring some data metrics. Maybe, there are other community users aware of better solution. Meanwhile, I would recommend to implement the checks and failures as separate operators and side outputs (for streaming) [1], if not yet Then you could also use Flink metrics to aggregate and monitor the data [2]. The metrics systems usually allow to define alerts on metrics, like in prometheus [3], [4].
Best, Andrey [1] https://ci.apache.org/projects/flink/flink-docs-release-1.10/dev/stream/side_output.html [2] https://ci.apache.org/projects/flink/flink-docs-release-1.10/monitoring/metrics.html [3] https://ci.apache.org/projects/flink/flink-docs-release-1.10/monitoring/metrics.html#prometheus-orgapacheflinkmetricsprometheusprometheusreporter [4] https://prometheus.io/docs/alerting/overview/ On Sat, Jun 6, 2020 at 9:23 AM aj <ajainje...@gmail.com> wrote: > Hello All, > > I want to do some data quality analysis on stream data example. > > 1. Fill rate in a particular column > 2. How many events are going to error queue due to favor schema > validation failed? > 3. Different statistics measure of a column. > 3. Alert if a particular threshold is breached (like if fill rate is less > than 90% for a column) > > Is there any library that exists on top of Flink for data quality. As I am > looking there is a library on top of the spark > https://github.com/awslabs/deequ > > This proved all that I am looking for. > > -- > Thanks & Regards, > Anuj Jain > > > > <http://www.cse.iitm.ac.in/%7Eanujjain/> >