Hello team, We're looking into the approach to migrate a databases-centralized microservice to Flink Stateful function, and want to understand more about downtime behaviors.
Briefly, the service consumes multiple Kafka topics, applies varied processors on each event. With the Flink Stateful function, very likely we'll have a state engine to replace the database and act as a router, processors are deployed as remote functions. >From documentation in https://flink.apache.org/stateful-functions.html, my take is: 1. state engine is a regular Flink application; 2. remote function is a stateless service; So #2 should be resilient to any failures caused by host failure, rolling upgrade, or even region loss if the target URL is global; For #1, how do we handle different scenarios of failures, like * when a host is lost, all hosts are restarted. It could be some with region failover; * When one host is busy, is the back-pressure mechanism activated here? * Can I avoid stop-the-world when upgrading #1? Thanks! Mingmin