Re: Best Practices? Fault Isolation for Processing Large Number of Same-Shaped Input Kafka Topics in a Big Flink Job

Biao Geng Sun, 12 May 2024 20:06:23 -0700

Hi Kevin,

I think the question is valuable and It looks like this question can be
posted at the user email list to receive more feedback.


As for the question, I just want to share some observations:
1. When there are hundreds of data pipelines, it is nearly impossible to
make all of them work properly for a long time. So we may need to consider
the failover strategy more carefully. The Exponential Delay Restart
Strategy may be a good choice or you can check the doc for more details:
https://nightlies.apache.org/flink/flink-docs-master/docs/ops/state/task_failure_recovery/
. Another point is if in your sql job, all of these topic are processed
parallely and there is no operator would try to connect them, make sure
that the config 'jobmanager.execution.failover-strategy' is set to 'region'
so that the failure in one pipeline would not lead other pipelines to be
restarted.
2. Sometimes logic isolation may not be enough, and I would say that for
some mission-critical pipelines, it is better to keep them in a dedicated
flink job and sometimes a backup job is also required.
3. When we use 'hundreds of kafka topics', there is also one point worth
mentioning: make sure the topics are distributed or replicated correctly.
When upstream kafka brokers fail due to some common reasons, the downstream
flink job can do nothing.

Hope these points help!

Best,
Biao Geng

Kevin Lam <kevin....@shopify.com.invalid> 于2024年5月9日周四 03:52写道：

> Hi everyone,
>
> I'm currently prototyping on a project where we need to process a large
> number of Kafka input topics (say, a couple of hundred), all of which share
> the same DataType/Schema.
>
> Our objective is to run the same Flink SQL on all of the input topics, but
> I am concerned about doing this in a single large Flink SQL application for
> fault isolation purposes. We'd like to limit the "blast radius" in cases of
> data issues or "poison pills" in any particular Kafka topic — meaning, if
> one topic runs into a problem, it shouldn’t compromise or halt the
> processing of the others.
>
> At the same time, we are concerned about the operational toil associated
> with managing hundreds of Flink jobs that are really one logical
> application.
>
> Has anyone here tackled a similar challenge? If so:
>
>    1. How did you design your solution to handle a vast number of topics
>    without creating a heavy management burden?
>    2. What strategies or patterns have you found effective in isolating
>    issues within a specific topic so that they do not affect the
> processing of
>    others?
>    3. Are there specific configurations or tools within the Flink ecosystem
>    that you'd recommend to efficiently manage this scenario?
>
> Any examples, suggestions, or references to relevant documentation would be
> helpful. Thank you in advance for your time and help!
>

Re: Best Practices? Fault Isolation for Processing Large Number of Same-Shaped Input Kafka Topics in a Big Flink Job

Reply via email to