Thanks, Andrey, I will check it out. On Mon, Jun 8, 2020 at 8:10 PM Andrey Zagrebin <azagre...@apache.org> wrote:
> Hi Anuj, > > I am not familiar with data quality measurement methods and deequ > <https://github.com/awslabs/deequ> in depth. > What you describe looks like monitoring some data metrics. > Maybe, there are other community users aware of better solution. > Meanwhile, I would recommend to implement the checks and failures as > separate operators and side outputs (for streaming) [1], if not yet > Then you could also use Flink metrics to aggregate and monitor the data > [2]. > The metrics systems usually allow to define alerts on metrics, like in > prometheus [3], [4]. > > Best, > Andrey > > [1] > https://ci.apache.org/projects/flink/flink-docs-release-1.10/dev/stream/side_output.html > [2] > https://ci.apache.org/projects/flink/flink-docs-release-1.10/monitoring/metrics.html > [3] > https://ci.apache.org/projects/flink/flink-docs-release-1.10/monitoring/metrics.html#prometheus-orgapacheflinkmetricsprometheusprometheusreporter > [4] https://prometheus.io/docs/alerting/overview/ > > On Sat, Jun 6, 2020 at 9:23 AM aj <ajainje...@gmail.com> wrote: > >> Hello All, >> >> I want to do some data quality analysis on stream data example. >> >> 1. Fill rate in a particular column >> 2. How many events are going to error queue due to favor schema >> validation failed? >> 3. Different statistics measure of a column. >> 3. Alert if a particular threshold is breached (like if fill rate is less >> than 90% for a column) >> >> Is there any library that exists on top of Flink for data quality. As I >> am looking there is a library on top of the spark >> https://github.com/awslabs/deequ >> >> This proved all that I am looking for. >> >> -- >> Thanks & Regards, >> Anuj Jain >> >> >> >> <http://www.cse.iitm.ac.in/%7Eanujjain/> >> > -- Thanks & Regards, Anuj Jain Mob. : +91- 8588817877 Skype : anuj.jain07 <http://www.oracle.com/> <http://www.cse.iitm.ac.in/%7Eanujjain/>