[ 
https://issues.apache.org/jira/browse/FLINK-38668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18038752#comment-18038752
 ] 

RocMarshal edited comment on FLINK-38668 at 11/17/25 3:53 AM:
--------------------------------------------------------------

The original testing guide doc is here : 
[https://docs.google.com/document/d/1ZXSwtwGeSxy8L2AHdpTnumhXNWWisho_a8dcxRYSvsk/edit?tab=t.0#heading=h.1vcje3u1wogz]

And the content as follows:
h1. 1 Motivation

This document primarily introduces the core working principles of the 
functionality introduced by Flip-370, as well as the key test cases that 
cross-team testing should focus on to verify the correctness of the feature.
h1. 2 You may need to be familiar with the core logic of balanced scheduling

Please refer to this 
[page|https://nightlies.apache.org/flink/flink-docs-release-2.2/docs/deployment/tasks-scheduling/balanced_tasks_scheduling/].
 
h1. 3 Constructing and validating test cases

As stated in the [FLIP|https://cwiki.apache.org/confluence/x/U56zDw] document, 
task balanced scheduling is based on the SlotPool perspective of the JobMaster 
to perform balanced task scheduling for a job. Therefore, all test cases in 
this test can be verified under the application execution mode (regardless of 
whether resources come from onYarn/onKubernetes).

Testing jobs: [https://github.com/RocMarshal/flip370-testing-jobs] 

 
h2. 3.1 Test for a job that contains a slot sharing group
h3. 3.1.1 Regular job test
 * Test case code
 * Entry-point 
class:{_}flip370.testing.slotsharinggroup.single.FlinkTestingJob{_}
 * Code description: FlinkTestingJob describes a streaming job that contains a 
default slot sharing group. The job includes a source operator with a 
parallelism of 10 and a sink operator with a parallelism of 20.
 * Job-level startup parameters:N.A

 * Description of the necessary configurations for an application cluster

 * 
 ** 
 *** _taskmanager.load-balance.mode: TASKS_

 * 
 ** 
 *** _taskmanager.numberOfTaskSlots: 2_

 * 
 ** 
 *** _restart-strategy.fixed-delay.attempts: 32_

 * 
 ** 
 *** _restart-strategy.fixed-delay.delay: 10s_

 * 
 ** 
 *** _jobmanager.scheduler: Adaptive/Default_ 

 * Submit the flink job.
 * Verification results
 ** *{color:#de350b}For jobmanager.scheduler: Adaptive{color}*
 *** Obtain the taskmanager on which each task is located through the following 
steps 
 **** !image-2025-11-17-11-33-45-545.png!

 * 
 ** 
 *** {color:#de350b}*Does it meet the balanced scheduling result (each task 
manager contains 3 tasks)?*{color}

 * 
 ** *{color:#de350b}For jobmanager.scheduler: Default{color}*
 *** *{color:#de350b}Does it meet the balanced scheduling result (each task 
manager contains 3 tasks)?{color}*

h3. 3.1.2 Failover scenario test
h4. 3.1.2.1 Failover scenario test triggered by tasks
 * Test case code
 ** Entry-point 
class:{_}flip370.testing.slotsharinggroup.single.FlinkTestingJob{_}
 ** Code description: FlinkTestingJob describes a streaming job that contains a 
default slot sharing group. The job includes a source operator with a 
parallelism of 10 and a sink operator with a parallelism of 20.
 ** Job-level startup parameters are as follows: pass 300000 as a parameter to 
the Flink job entry class, which indicates that the 0th subtask of the source 
operator will throw a task exception every 5 minutes to trigger a job failover: 
 _flip370.testing.slotsharinggroup.single.FlinkTestingJob 300000_

 * Description of the necessary configurations for an application cluster

 * 
 ** 
 *** _taskmanager.load-balance.mode: TASKS_

 * 
 ** 
 *** _taskmanager.numberOfTaskSlots: 2_

 * 
 ** 
 *** _restart-strategy.fixed-delay.attempts: 32_

 * 
 ** 
 *** _restart-strategy.fixed-delay.delay: 10s_

 * 
 ** 
 *** _jobmanager.scheduler: Adaptive/Default_ 

 * Submit the flink job.
 * Wait the task's exception for failover
 * *{color:#de350b}Verification results{color}*
 ** *{color:#de350b}For jobmanager.scheduler: Adaptive{color}*
 *** *{color:#de350b}Does it meet the balanced scheduling result (each task 
manager contains 3 tasks)?{color}*
 ** *{color:#de350b}For jobmanager.scheduler: Default{color}*
 *** *{color:#de350b}Does it meet the balanced scheduling result (each task 
manager contains 3 tasks)?{color}*

h4. 3.1.2.2 Failover scenario test triggered by TaskManagers
 * Test case code
 * Entry-point 
class:{_}flip370.testing.slotsharinggroup.single.FlinkTestingJob{_}
 * Code description: FlinkTestingJob describes a streaming job that contains a 
default slot sharing group. The job includes a source operator with a 
parallelism of 10 and a sink operator with a parallelism of 20.
 * Job-level startup parameters: N.A.

 * Description of the necessary configurations for an application cluster

 * 
 ** 
 *** _taskmanager.load-balance.mode: TASKS_

 * 
 ** 
 *** _taskmanager.numberOfTaskSlots: 2_

 * 
 ** 
 *** _restart-strategy.fixed-delay.attempts: 32_

 * 
 ** 
 *** _restart-strategy.fixed-delay.delay: 10s_

 * 
 ** 
 *** _jobmanager.scheduler: Adaptive/Default_ 

 * Submit the flink job.
 * How to simulate TaskManager-level failures?
 ** Manually kill any one or more TaskManager instances/containers in the job 
cluster.
 * Wait for failover completed.
 * {color:#de350b}*Verification results*{color}
 ** {color:#de350b}*For jobmanager.scheduler: Adaptive*{color}
 *** {color:#de350b}*Does it meet the balanced scheduling result (each task 
manager contains 3 tasks)?*{color}
 ** {color:#de350b}*For jobmanager.scheduler: Default*{color}
 *** {color:#de350b}*Does it meet the balanced scheduling result (each task 
manager contains 3 tasks)?*{color}

h2. 3.2 Test for a job that contains multiple slot sharing groups
h3. 3.2.1 Regular job test
 * Test case code
 * Entry-point 
class:{_}flip370.testing.slotsharinggroup.multiple.FlinkTestingJob{_}
 * Code description: FlinkTestingJob describes a streaming job that contains a 
default slot sharing group and an ssg2 slot sharing group. Each slot sharing 
group contains a source operator with a parallelism of 10 and a sink operator 
with a parallelism of 20.
 * Job-level startup parameters: N.A.

 * Description of the necessary configurations for an application cluster

 * 
 ** 
 *** _taskmanager.load-balance.mode: TASKS_

 * 
 ** 
 *** _taskmanager.numberOfTaskSlots: 2_

 * 
 ** 
 *** _restart-strategy.fixed-delay.attempts: 32_

 * 
 ** 
 *** _restart-strategy.fixed-delay.delay: 10s_

 * 
 ** 
 *** _jobmanager.scheduler: Adaptive/Default_ 

 * Submit the flink job.
 * *{color:#de350b}Verification results{color}*
 ** *{color:#de350b}For jobmanager.scheduler: Adaptive{color}*
 *** *{color:#de350b}Does it meet the balanced scheduling result (each task 
manager contains 3 tasks)?{color}*
 ** *{color:#de350b}For jobmanager.scheduler: Default{color}*
 *** *{color:#de350b}Does it meet the balanced scheduling result (each task 
manager contains 3 tasks)?{color}*

h3. 3.2.2 Failover scenario test
h4. 3.2.2.1  Failover scenario test triggered by tasks
 * Test case code
 * Entry-point 
class:{_}flip370.testing.slotsharinggroup.multiple.FlinkTestingJob{_}
 * Code description: FlinkTestingJob describes a streaming job that contains a 
default slot sharing group and an ssg2 slot sharing group. Each slot sharing 
group contains a source operator with a parallelism of 10 and a sink operator 
with a parallelism of 20.
 * Job-level startup parameters are as follows: pass 300000 as a parameter to 
the Flink job entry class, which indicates that the 0th subtask of the source 
operator will throw a task exception every 5 minutes to trigger a job failover: 
 _flip370.testing.slotsharinggroup.single.FlinkTestingJob 300000_

 * Description of the necessary configurations for an application cluster

 * 
 ** 
 *** _taskmanager.load-balance.mode: TASKS_

 * 
 ** 
 *** _taskmanager.numberOfTaskSlots: 2_

 * 
 ** 
 *** _restart-strategy.fixed-delay.attempts: 32_

 * 
 ** 
 *** _restart-strategy.fixed-delay.delay: 10s_

 * 
 ** 
 *** _jobmanager.scheduler: Adaptive/Default_ 

 * Submit the flink job.
 * Wait the task's exception for failover
 * {color:#de350b}*Verification results*{color}
 ** {color:#de350b}*For jobmanager.scheduler: Adaptive*{color}
 *** {color:#de350b}*Does it meet the balanced scheduling result (each task 
manager contains 3 tasks)?*{color}
 ** {color:#de350b}*For jobmanager.scheduler: Default*{color}
 *** {color:#de350b}*Does it meet the balanced scheduling result (each task 
manager contains 3 tasks)?*{color}

h4. 3.2.2.2 Failover scenario test triggered by TaskManagers
 * Test case code
 * Entry-point 
class:{_}flip370.testing.slotsharinggroup.multiple.FlinkTestingJob{_}
 * Code description: FlinkTestingJob describes a streaming job that contains a 
default slot sharing group and an ssg2 slot sharing group. Each slot sharing 
group contains a source operator with a parallelism of 10 and a sink operator 
with a parallelism of 20.
 * Job-level startup parameters: N.A.

 * Description of the necessary configurations for an application cluster

 * 
 ** 
 *** _taskmanager.load-balance.mode: TASKS_

 * 
 ** 
 *** _taskmanager.numberOfTaskSlots: 2_

 * 
 ** 
 *** _restart-strategy.fixed-delay.attempts: 32_

 * 
 ** 
 *** _restart-strategy.fixed-delay.delay: 10s_

 * 
 ** 
 *** _jobmanager.scheduler: Adaptive/Default_ 

 * Submit the flink job.
 * How to simulate TaskManager-level failures?
 ** Manually kill any one or more TaskManager instances/containers in the job 
cluster.
 * Wait for failover completed.
 * *{color:#de350b}Verification results{color}*
 ** *{color:#de350b}For jobmanager.scheduler: Adaptive{color}*
 *** *{color:#de350b}Does it meet the balanced scheduling result (each task 
manager contains 3 tasks)?{color}*
 ** *{color:#de350b}For jobmanager.scheduler: Default{color}*
 *** *{color:#de350b}Does it meet the balanced scheduling result (each task 
manager contains 3 tasks)?{color}*

Ping [~RocMarshal] if there’re any issues during the testing 

Thank you!

CC [~ruanhang1993] 


was (Author: rocmarshal):
The original testing guide doc is here : 
[https://docs.google.com/document/d/1ZXSwtwGeSxy8L2AHdpTnumhXNWWisho_a8dcxRYSvsk/edit?tab=t.0#heading=h.1vcje3u1wogz]

And the content as follows:
h1. 1 Motivation

This document primarily introduces the core working principles of the 
functionality introduced by Flip-370, as well as the key test cases that 
cross-team testing should focus on to verify the correctness of the feature.
h1. 2 You may need to be familiar with the core logic of balanced scheduling

Please refer to this 
[page|https://nightlies.apache.org/flink/flink-docs-release-2.2/docs/deployment/tasks-scheduling/balanced_tasks_scheduling/].
 
h1. 3 Constructing and validating test cases

As stated in the [FLIP|https://cwiki.apache.org/confluence/x/U56zDw] document, 
task balanced scheduling is based on the SlotPool perspective of the JobMaster 
to perform balanced task scheduling for a job. Therefore, all test cases in 
this test can be verified under the application execution mode (regardless of 
whether resources come from onYarn/onKubernetes).

Testing jobs: [https://github.com/RocMarshal/flip370-testing-jobs] 

 
h2. 3.1 Test for a job that contains a slot sharing group
h3. 3.1.1 Regular job test
 * Test case code
 * Entry-point 
class:{_}flip370.testing.slotsharinggroup.single.FlinkTestingJob{_}
 * Code description: FlinkTestingJob describes a streaming job that contains a 
default slot sharing group. The job includes a source operator with a 
parallelism of 10 and a sink operator with a parallelism of 20.
 * Job-level startup parameters:N.A

 * Description of the necessary configurations for an application cluster

 * 
 ** 
 *** _taskmanager.load-balance.mode: TASKS_

 * 
 ** 
 *** _taskmanager.numberOfTaskSlots: 2_

 * 
 ** 
 *** _restart-strategy.fixed-delay.attempts: 32_

 * 
 ** 
 *** _restart-strategy.fixed-delay.delay: 10s_

 * 
 ** 
 *** _jobmanager.scheduler: Adaptive/Default_ 

 * Verification results
 ** *{color:#de350b}For jobmanager.scheduler: Adaptive{color}*
 *** Obtain the taskmanager on which each task is located through the following 
steps 
 **** !image-2025-11-17-11-33-45-545.png!

 * 
 ** 
 *** {color:#de350b}*Does it meet the balanced scheduling result (each task 
manager contains 3 tasks)?*{color}

 * 
 ** *{color:#de350b}For jobmanager.scheduler: Default{color}*
 *** *{color:#de350b}Does it meet the balanced scheduling result (each task 
manager contains 3 tasks)?{color}*

h3. 3.1.2 Failover scenario test
h4. 3.1.2.1 Failover scenario test triggered by tasks
 * Test case code
 ** Entry-point 
class:{_}flip370.testing.slotsharinggroup.single.FlinkTestingJob{_}
 ** Code description: FlinkTestingJob describes a streaming job that contains a 
default slot sharing group. The job includes a source operator with a 
parallelism of 10 and a sink operator with a parallelism of 20.
 ** Job-level startup parameters are as follows: pass 300000 as a parameter to 
the Flink job entry class, which indicates that the 0th subtask of the source 
operator will throw a task exception every 5 minutes to trigger a job failover: 
 _flip370.testing.slotsharinggroup.single.FlinkTestingJob 300000_

 * Description of the necessary configurations for an application cluster

 * 
 ** 
 *** _taskmanager.load-balance.mode: TASKS_

 * 
 ** 
 *** _taskmanager.numberOfTaskSlots: 2_

 * 
 ** 
 *** _restart-strategy.fixed-delay.attempts: 32_

 * 
 ** 
 *** _restart-strategy.fixed-delay.delay: 10s_

 * 
 ** 
 *** _jobmanager.scheduler: Adaptive/Default_ 

 * *{color:#de350b}Verification results{color}*
 ** *{color:#de350b}For jobmanager.scheduler: Adaptive{color}*
 *** *{color:#de350b}Does it meet the balanced scheduling result (each task 
manager contains 3 tasks)?{color}*
 ** *{color:#de350b}For jobmanager.scheduler: Default{color}*
 *** *{color:#de350b}Does it meet the balanced scheduling result (each task 
manager contains 3 tasks)?{color}*

h4. 3.1.2.2 Failover scenario test triggered by TaskManagers
 * Test case code
 * Entry-point 
class:{_}flip370.testing.slotsharinggroup.single.FlinkTestingJob{_}
 * Code description: FlinkTestingJob describes a streaming job that contains a 
default slot sharing group. The job includes a source operator with a 
parallelism of 10 and a sink operator with a parallelism of 20.
 * Job-level startup parameters: N.A.

 * Description of the necessary configurations for an application cluster

 * 
 ** 
 *** _taskmanager.load-balance.mode: TASKS_

 * 
 ** 
 *** _taskmanager.numberOfTaskSlots: 2_

 * 
 ** 
 *** _restart-strategy.fixed-delay.attempts: 32_

 * 
 ** 
 *** _restart-strategy.fixed-delay.delay: 10s_

 * 
 ** 
 *** _jobmanager.scheduler: Adaptive/Default_ 

 * How to simulate TaskManager-level failures?
 ** Manually kill any one or more TaskManager instances/containers in the job 
cluster.
 * {color:#de350b}*Verification results*{color}
 ** {color:#de350b}*For jobmanager.scheduler: Adaptive*{color}
 *** {color:#de350b}*Does it meet the balanced scheduling result (each task 
manager contains 3 tasks)?*{color}
 ** {color:#de350b}*For jobmanager.scheduler: Default*{color}
 *** {color:#de350b}*Does it meet the balanced scheduling result (each task 
manager contains 3 tasks)?*{color}

h2. 3.2 Test for a job that contains multiple slot sharing groups
h3. 3.2.1 Regular job test
 * Test case code
 * Entry-point 
class:{_}flip370.testing.slotsharinggroup.multiple.FlinkTestingJob{_}
 * Code description: FlinkTestingJob describes a streaming job that contains a 
default slot sharing group and an ssg2 slot sharing group. Each slot sharing 
group contains a source operator with a parallelism of 10 and a sink operator 
with a parallelism of 20.
 * Job-level startup parameters: N.A.

 * Description of the necessary configurations for an application cluster

 * 
 ** 
 *** _taskmanager.load-balance.mode: TASKS_

 * 
 ** 
 *** _taskmanager.numberOfTaskSlots: 2_

 * 
 ** 
 *** _restart-strategy.fixed-delay.attempts: 32_

 * 
 ** 
 *** _restart-strategy.fixed-delay.delay: 10s_

 * 
 ** 
 *** _jobmanager.scheduler: Adaptive/Default_ 

 * *{color:#de350b}Verification results{color}*
 ** *{color:#de350b}For jobmanager.scheduler: Adaptive{color}*
 *** *{color:#de350b}Does it meet the balanced scheduling result (each task 
manager contains 3 tasks)?{color}*
 ** *{color:#de350b}For jobmanager.scheduler: Default{color}*
 *** *{color:#de350b}Does it meet the balanced scheduling result (each task 
manager contains 3 tasks)?{color}*

h3. 3.2.2 Failover scenario test
h4. 3.2.2.1  Failover scenario test triggered by tasks
 * Test case code
 * Entry-point 
class:{_}flip370.testing.slotsharinggroup.multiple.FlinkTestingJob{_}
 * Code description: FlinkTestingJob describes a streaming job that contains a 
default slot sharing group and an ssg2 slot sharing group. Each slot sharing 
group contains a source operator with a parallelism of 10 and a sink operator 
with a parallelism of 20.
 * Job-level startup parameters are as follows: pass 300000 as a parameter to 
the Flink job entry class, which indicates that the 0th subtask of the source 
operator will throw a task exception every 5 minutes to trigger a job failover: 
 _flip370.testing.slotsharinggroup.single.FlinkTestingJob 300000_

 * Description of the necessary configurations for an application cluster

 * 
 ** 
 *** _taskmanager.load-balance.mode: TASKS_

 * 
 ** 
 *** _taskmanager.numberOfTaskSlots: 2_

 * 
 ** 
 *** _restart-strategy.fixed-delay.attempts: 32_

 * 
 ** 
 *** _restart-strategy.fixed-delay.delay: 10s_

 * 
 ** 
 *** _jobmanager.scheduler: Adaptive/Default_ 

 * {color:#de350b}*Verification results*{color}
 ** {color:#de350b}*For jobmanager.scheduler: Adaptive*{color}
 *** {color:#de350b}*Does it meet the balanced scheduling result (each task 
manager contains 3 tasks)?*{color}
 ** {color:#de350b}*For jobmanager.scheduler: Default*{color}
 *** {color:#de350b}*Does it meet the balanced scheduling result (each task 
manager contains 3 tasks)?*{color}

h4. 3.2.2.2 Failover scenario test triggered by TaskManagers
 * Test case code
 * Entry-point 
class:{_}flip370.testing.slotsharinggroup.multiple.FlinkTestingJob{_}
 * Code description: FlinkTestingJob describes a streaming job that contains a 
default slot sharing group and an ssg2 slot sharing group. Each slot sharing 
group contains a source operator with a parallelism of 10 and a sink operator 
with a parallelism of 20.
 * Job-level startup parameters: N.A.

 * Description of the necessary configurations for an application cluster

 * 
 ** 
 *** _taskmanager.load-balance.mode: TASKS_

 * 
 ** 
 *** _taskmanager.numberOfTaskSlots: 2_

 * 
 ** 
 *** _restart-strategy.fixed-delay.attempts: 32_

 * 
 ** 
 *** _restart-strategy.fixed-delay.delay: 10s_

 * 
 ** 
 *** _jobmanager.scheduler: Adaptive/Default_ 

 * How to simulate TaskManager-level failures?
 ** Manually kill any one or more TaskManager instances/containers in the job 
cluster.
 * *{color:#de350b}Verification results{color}*
 ** *{color:#de350b}For jobmanager.scheduler: Adaptive{color}*
 *** *{color:#de350b}Does it meet the balanced scheduling result (each task 
manager contains 3 tasks)?{color}*
 ** *{color:#de350b}For jobmanager.scheduler: Default{color}*
 *** *{color:#de350b}Does it meet the balanced scheduling result (each task 
manager contains 3 tasks)?{color}*

Ping [~RocMarshal] if there’re any issues during the testing 

Thank you!

CC [~ruanhang1993] 

> Release Testing Instructions: FLIP-370: Support Balanced Tasks Scheduling
> -------------------------------------------------------------------------
>
>                 Key: FLINK-38668
>                 URL: https://issues.apache.org/jira/browse/FLINK-38668
>             Project: Flink
>          Issue Type: Sub-task
>            Reporter: Ruan Hang
>            Assignee: Yuepeng Pan
>            Priority: Blocker
>              Labels: release-testing
>             Fix For: 2.2.0
>
>         Attachments: image-2025-11-17-11-33-45-545.png
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to