Thank you Igal for the comments. Flink's high availability[1] is good enough to avoid the single-point-of-failure in JobManager. The other scenario I'm reading is Task Failure Strategy[2], how is region failover strategy performed, given that event routing is dynamic.
For upgrading in #1, I'm referring to the restart of state engine, aka the Flink application. It should be un-avoidable during scaling in/out. Not sure about regular (rolling-)deployment. For example we deploy the micro-service every week even no code change in our side, but may be changes in dependencies. - [2] https://nightlies.apache.org/flink/flink-docs-release-1.13/docs/dev/execution/task_failure_recovery/#failover-strategies Mingmin On Tue, Nov 23, 2021 at 7:24 AM Igal Shilman <i...@apache.org> wrote: > Hello Mingmin, > > Your described scenario and use case indeed seems like a good fit for > StateFun. > Also your analysis is correct, StateFun is executed as a very specific > Flink application (this is what you call "state engine"), and remote > functions are effectively a stateless service. > > * when a host is lost, all hosts are restarted. It could be some with >> region failover; > > > You can read more about Flink's high availability here [1] > > * When one host is busy, is the back-pressure mechanism activated here? > > I assume by host you mean, one of the worker hosts (Flink TaskManagers). > Flink implements back pressure nativity, > And StateFun benefits from that out of the box (plus some additional > tricks that StateFun provides on top). > > If you mean by host a remote function, then StateFun will buffer some > messages internally in case of a slow remote function, and if that buffer > fills up, it will trigger back pressure. > > * Can I avoid stop-the-world when upgrading #1? > > It is not really clear what you mean by upgrading, if you mean the remote > function: then you can upgrade it without any changes to the Flink side of > things. If you need to upgrade the Flink cluster, then unfortunately > currently this is a stop-the-world upgrade. > The nice thing about how StateFun is implemented as a Flink application, > is that it allows adding/removing remote functions without > the need of going trough a stop-the-world upgrade. > > I hope this helps a bit, > Igal. > > [1] > https://nightlies.apache.org/flink/flink-docs-release-1.14/docs/deployment/ha/overview/ > > > On Tue, Nov 23, 2021 at 12:58 AM Mingmin Xu <ming...@apache.org> wrote: > >> Hello team, >> >> We're looking into the approach to migrate a databases-centralized >> microservice to Flink Stateful function, and want to understand more about >> downtime behaviors. >> >> Briefly, the service consumes multiple Kafka topics, applies varied >> processors on each event. With the Flink Stateful function, very likely >> we'll have a state engine to replace the database and act as a router, >> processors are deployed as remote functions. >> >> From documentation in https://flink.apache.org/stateful-functions.html, >> my take is: >> 1. state engine is a regular Flink application; >> 2. remote function is a stateless service; >> >> So #2 should be resilient to any failures caused by host failure, rolling >> upgrade, or even region loss if the target URL is global; >> For #1, how do we handle different scenarios of failures, like >> * when a host is lost, all hosts are restarted. It could be some with >> region failover; >> * When one host is busy, is the back-pressure mechanism activated here? >> * Can I avoid stop-the-world when upgrading #1? >> >> Thanks! >> Mingmin >> >