For metrics about failover, you can refer to [1]

[1]
https://nightlies.apache.org/flink/flink-docs-master/docs/ops/metrics/#availability

Best,
Weihua


On Tue, Mar 28, 2023 at 1:34 AM santhosh venkat <
santhoshvenkat1...@gmail.com> wrote:

> Hi,
>
> Thank you so much for taking time to answer my questions and pointing me to
> relevant documentation. Really appreciate it.
>
> When the task failover happens, are there internal metrics in Flink at a
> job level to track the new execution attempt?  Is there a way for the
> application owner to figure out how many task failovers have happened in a
> job execution and get the current execution attempt.
>
> Thanks.
>
> On Mon, Mar 27, 2023 at 2:55 AM Weihua Hu <huweihua....@gmail.com> wrote:
>
> > Hi,
> >
> > 1. Does this mean that  each  task slot will contain an entire pipeline
> in
> > > the job?
> >
> > not exactly, each slot will run a subtask of each task. If the job is so
> > simple that
> > there is no keyby logic and we do not enable rebalance shuffle type, each
> > slot
> > could run all the pipeline. But if not we need to shuffle data to other
> > subtasks.
> > You can get some examples from [1].
> >
> > 2. Upon a TM pod failure and after K8s brings back the TM pod, would
> flink
> > > assign the same subtasks back to restarted TM  again? Or will they be
> > > distributed to different TaskManagers?
> >
> > If there is no shuffle data in your job (described in 1), only tasks on
> > failure pods
> >  will be restarted, and they will be assigned to the new TM again.
> > But if not, all the related tasks will be restarted. When these tasks
> > re-scheduled,
> > there are some strategy to assign slots. They will try to assign the task
> > to previous
> > slot to reduce the recovery time, But there is no guarantee.
> > You can read [2] to get more information about failure recovery.
> >
> >
> > [1]
> >
> >
> https://nightlies.apache.org/flink/flink-docs-master/docs/concepts/flink-architecture/#tasks-and-operator-chains
> > [2]
> >
> >
> https://nightlies.apache.org/flink/flink-docs-master/docs/ops/state/task_failure_recovery/
> >
> > Best,
> > Weihua
> >
> >
> > On Mon, Mar 27, 2023 at 3:22 PM santhosh venkat <
> > santhoshvenkat1...@gmail.com> wrote:
> >
> > > Hi,
> > >
> > > I am trying to understand how subtask distribution works in Flink.
> Let's
> > > assume a setup of a Flink cluster with a fixed number of TaskManagers
> in
> > a
> > > kubernetes cluster.
> > >
> > > Let's say I have a flink job with all the operators having the same
> > > parallelism and with the same Slot sharing group. The operator
> > parallelism
> > > is computed as the number of task managers multiplied by number of task
> > > slots per TM.
> > >
> > > 1. Does this mean that  each  task slot will contain an entire pipeline
> > in
> > > the job?
> > > 2. Upon a TM pod failure and after K8s brings back the TM pod, would
> > flink
> > > assign the same subtasks back to restarted TM  again? Or will they be
> > > distributed to different TaskManagers?
> > >
> > > It would be great if someone can answer this question.
> > >
> > > Thanks.
> > >
> >
>

Reply via email to