Zero downtime in Flink Stateful funciton

Mingmin Xu Mon, 22 Nov 2021 15:57:45 -0800

Hello team,

We're looking into the approach to migrate a databases-centralized
microservice to Flink Stateful function, and want to understand more about
downtime behaviors.


Briefly, the service consumes multiple Kafka topics, applies varied
processors on each event. With the Flink Stateful function, very likely
we'll have a state engine to replace the database and act as a router,
processors are deployed as remote functions.

>From documentation in https://flink.apache.org/stateful-functions.html, my
take is:
1. state engine is a regular Flink application;
2. remote function is a stateless service;

So #2 should be resilient to any failures caused by host failure, rolling
upgrade, or even region loss if the target URL is global;
For #1, how do we handle different scenarios of failures, like
* when a host is lost, all hosts are restarted. It could be some with
region failover;
* When one host is busy, is the back-pressure mechanism activated here?
* Can I avoid stop-the-world when upgrading #1?

Thanks!
Mingmin

Zero downtime in Flink Stateful funciton

Reply via email to