Re: Zero downtime in Flink Stateful funciton

Mingmin Xu Tue, 23 Nov 2021 12:10:41 -0800

Thank you Igal for the comments.

Flink's high availability[1] is good enough to avoid the
single-point-of-failure in JobManager. The other scenario I'm reading is
Task Failure Strategy[2], how is region failover strategy performed, given
that event routing is dynamic.


For upgrading in #1, I'm referring to the restart of state engine, aka the
Flink application. It should be un-avoidable during scaling in/out. Not
sure about regular (rolling-)deployment. For example we deploy the
micro-service every week even no code change in our side, but may be
changes in dependencies.
-
[2]
https://nightlies.apache.org/flink/flink-docs-release-1.13/docs/dev/execution/task_failure_recovery/#failover-strategies


Mingmin


On Tue, Nov 23, 2021 at 7:24 AM Igal Shilman <i...@apache.org> wrote:

> Hello Mingmin,
>
> Your described scenario and use case indeed seems like a good fit for
> StateFun.
> Also your analysis is correct, StateFun is executed as a very specific
> Flink application (this is what you call "state engine"), and remote
> functions are effectively a stateless service.
>
> * when a host is lost, all hosts are restarted. It could be some with
>> region failover;
>
>
> You can read more about Flink's high availability here [1]
>
> * When one host is busy, is the back-pressure mechanism activated here?
>
> I assume by host you mean, one of the worker hosts (Flink TaskManagers).
> Flink implements back pressure nativity,
> And StateFun benefits from that out of the box (plus some additional
> tricks that StateFun provides on top).
>
> If you mean by host a remote function, then StateFun will buffer some
> messages internally in case of a slow remote function, and if that buffer
> fills up, it will trigger back pressure.
>
> * Can I avoid stop-the-world when upgrading #1?
>
> It is not really clear what you mean by upgrading, if you mean the remote
> function: then you can upgrade it without any changes to the Flink side of
> things. If you need to upgrade the Flink cluster, then unfortunately
> currently this is a stop-the-world upgrade.
> The nice thing about how StateFun is implemented as a Flink application,
> is that it allows adding/removing remote functions without
> the need of going trough a stop-the-world upgrade.
>
> I hope this helps a bit,
> Igal.
>
> [1]
> https://nightlies.apache.org/flink/flink-docs-release-1.14/docs/deployment/ha/overview/
>
>
> On Tue, Nov 23, 2021 at 12:58 AM Mingmin Xu <ming...@apache.org> wrote:
>
>> Hello team,
>>
>> We're looking into the approach to migrate a databases-centralized
>> microservice to Flink Stateful function, and want to understand more about
>> downtime behaviors.
>>
>> Briefly, the service consumes multiple Kafka topics, applies varied
>> processors on each event. With the Flink Stateful function, very likely
>> we'll have a state engine to replace the database and act as a router,
>> processors are deployed as remote functions.
>>
>> From documentation in https://flink.apache.org/stateful-functions.html,
>> my take is:
>> 1. state engine is a regular Flink application;
>> 2. remote function is a stateless service;
>>
>> So #2 should be resilient to any failures caused by host failure, rolling
>> upgrade, or even region loss if the target URL is global;
>> For #1, how do we handle different scenarios of failures, like
>> * when a host is lost, all hosts are restarted. It could be some with
>> region failover;
>> * When one host is busy, is the back-pressure mechanism activated here?
>> * Can I avoid stop-the-world when upgrading #1?
>>
>> Thanks!
>> Mingmin
>>
>

Re: Zero downtime in Flink Stateful funciton

Reply via email to