For metrics about failover, you can refer to [1] [1] https://nightlies.apache.org/flink/flink-docs-master/docs/ops/metrics/#availability
Best, Weihua On Tue, Mar 28, 2023 at 1:34 AM santhosh venkat < santhoshvenkat1...@gmail.com> wrote: > Hi, > > Thank you so much for taking time to answer my questions and pointing me to > relevant documentation. Really appreciate it. > > When the task failover happens, are there internal metrics in Flink at a > job level to track the new execution attempt? Is there a way for the > application owner to figure out how many task failovers have happened in a > job execution and get the current execution attempt. > > Thanks. > > On Mon, Mar 27, 2023 at 2:55 AM Weihua Hu <huweihua....@gmail.com> wrote: > > > Hi, > > > > 1. Does this mean that each task slot will contain an entire pipeline > in > > > the job? > > > > not exactly, each slot will run a subtask of each task. If the job is so > > simple that > > there is no keyby logic and we do not enable rebalance shuffle type, each > > slot > > could run all the pipeline. But if not we need to shuffle data to other > > subtasks. > > You can get some examples from [1]. > > > > 2. Upon a TM pod failure and after K8s brings back the TM pod, would > flink > > > assign the same subtasks back to restarted TM again? Or will they be > > > distributed to different TaskManagers? > > > > If there is no shuffle data in your job (described in 1), only tasks on > > failure pods > > will be restarted, and they will be assigned to the new TM again. > > But if not, all the related tasks will be restarted. When these tasks > > re-scheduled, > > there are some strategy to assign slots. They will try to assign the task > > to previous > > slot to reduce the recovery time, But there is no guarantee. > > You can read [2] to get more information about failure recovery. > > > > > > [1] > > > > > https://nightlies.apache.org/flink/flink-docs-master/docs/concepts/flink-architecture/#tasks-and-operator-chains > > [2] > > > > > https://nightlies.apache.org/flink/flink-docs-master/docs/ops/state/task_failure_recovery/ > > > > Best, > > Weihua > > > > > > On Mon, Mar 27, 2023 at 3:22 PM santhosh venkat < > > santhoshvenkat1...@gmail.com> wrote: > > > > > Hi, > > > > > > I am trying to understand how subtask distribution works in Flink. > Let's > > > assume a setup of a Flink cluster with a fixed number of TaskManagers > in > > a > > > kubernetes cluster. > > > > > > Let's say I have a flink job with all the operators having the same > > > parallelism and with the same Slot sharing group. The operator > > parallelism > > > is computed as the number of task managers multiplied by number of task > > > slots per TM. > > > > > > 1. Does this mean that each task slot will contain an entire pipeline > > in > > > the job? > > > 2. Upon a TM pod failure and after K8s brings back the TM pod, would > > flink > > > assign the same subtasks back to restarted TM again? Or will they be > > > distributed to different TaskManagers? > > > > > > It would be great if someone can answer this question. > > > > > > Thanks. > > > > > >