Hi Kevin, I think the question is valuable and It looks like this question can be posted at the user email list to receive more feedback.
As for the question, I just want to share some observations: 1. When there are hundreds of data pipelines, it is nearly impossible to make all of them work properly for a long time. So we may need to consider the failover strategy more carefully. The Exponential Delay Restart Strategy may be a good choice or you can check the doc for more details: https://nightlies.apache.org/flink/flink-docs-master/docs/ops/state/task_failure_recovery/ . Another point is if in your sql job, all of these topic are processed parallely and there is no operator would try to connect them, make sure that the config 'jobmanager.execution.failover-strategy' is set to 'region' so that the failure in one pipeline would not lead other pipelines to be restarted. 2. Sometimes logic isolation may not be enough, and I would say that for some mission-critical pipelines, it is better to keep them in a dedicated flink job and sometimes a backup job is also required. 3. When we use 'hundreds of kafka topics', there is also one point worth mentioning: make sure the topics are distributed or replicated correctly. When upstream kafka brokers fail due to some common reasons, the downstream flink job can do nothing. Hope these points help! Best, Biao Geng Kevin Lam <kevin....@shopify.com.invalid> 于2024年5月9日周四 03:52写道: > Hi everyone, > > I'm currently prototyping on a project where we need to process a large > number of Kafka input topics (say, a couple of hundred), all of which share > the same DataType/Schema. > > Our objective is to run the same Flink SQL on all of the input topics, but > I am concerned about doing this in a single large Flink SQL application for > fault isolation purposes. We'd like to limit the "blast radius" in cases of > data issues or "poison pills" in any particular Kafka topic — meaning, if > one topic runs into a problem, it shouldn’t compromise or halt the > processing of the others. > > At the same time, we are concerned about the operational toil associated > with managing hundreds of Flink jobs that are really one logical > application. > > Has anyone here tackled a similar challenge? If so: > > 1. How did you design your solution to handle a vast number of topics > without creating a heavy management burden? > 2. What strategies or patterns have you found effective in isolating > issues within a specific topic so that they do not affect the > processing of > others? > 3. Are there specific configurations or tools within the Flink ecosystem > that you'd recommend to efficiently manage this scenario? > > Any examples, suggestions, or references to relevant documentation would be > helpful. Thank you in advance for your time and help! >