Hi, vtygoss > I'm working on migrating from full-data-pipeline(with spark) to > incremental-data-pipeline(with flink cdc), and i met a problem about accuracy > validation between pipeline based flink and spark.
Glad to hear that ! > For bounded data, it's simple to validate the two result sets are consitent > or not. > But, for unbouned data and event-driven application, how to make sure the > data stream produced is correct, especially when there are some retract > functions with high impactions, e.g. row_number. > > Is there any document for this preblom? Thanks for your any suggestions or > replies. The validation feature belongs data quality scope from my understanding, it’s usually provided by the platform e.g. the Data Integration Platform. As the underlying pipeline engine/tools, Flink CDC should expose more metrics or data quality checking abilities but we didn’t offers them yet, and these enhancements is on our roadmap. Currently, you can use Flink source/sink operator’s metric as a rough validation, you can also compare the records count in your source database and sink system multiple times for more accurate validation. Best, Leonard