[
https://issues.apache.org/jira/browse/FLINK-38687?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Ruan Hang updated FLINK-38687:
------------------------------
Description:
The original testing guide doc is here :
[https://docs.google.com/document/d/1ZXSwtwGeSxy8L2AHdpTnumhXNWWisho_a8dcxRYSvsk/edit?tab=t.0#heading=h.1vcje3u1wogz]
And the content as follows:
h1. 1 Motivation
This document primarily introduces the core working principles of the
functionality introduced by Flip-370, as well as the key test cases that
cross-team testing should focus on to verify the correctness of the feature.
h1. 2 You may need to be familiar with the core logic of balanced scheduling
Please refer to this
[page|https://nightlies.apache.org/flink/flink-docs-release-2.2/docs/deployment/tasks-scheduling/balanced_tasks_scheduling/].
h1. 3 Constructing and validating test cases
As stated in the [FLIP|https://cwiki.apache.org/confluence/x/U56zDw] document,
task balanced scheduling is based on the SlotPool perspective of the JobMaster
to perform balanced task scheduling for a job. Therefore, all test cases in
this test can be verified under the application execution mode (regardless of
whether resources come from onYarn/onKubernetes).
Testing jobs: [https://github.com/RocMarshal/flip370-testing-jobs]
h2. 3.1 Test for a job that contains a slot sharing group
h3. 3.1.1 Regular job test
* Test case code
* Entry-point
class:{_}flip370.testing.slotsharinggroup.single.FlinkTestingJob{_}
* Code description: FlinkTestingJob describes a streaming job that contains a
default slot sharing group. The job includes a source operator with a
parallelism of 10 and a sink operator with a parallelism of 20.
* Job-level startup parameters:N.A
* Description of the necessary configurations for an application cluster
*
**
*** _taskmanager.load-balance.mode: TASKS_
*
**
*** _taskmanager.numberOfTaskSlots: 2_
*
**
*** _restart-strategy.fixed-delay.attempts: 32_
*
**
*** _restart-strategy.fixed-delay.delay: 10s_
*
**
*** _jobmanager.scheduler: Adaptive/Default_
* Submit the flink job.
* Verification results
** *{color:#de350b}For jobmanager.scheduler: Adaptive{color}*
*** Obtain the taskmanager on which each task is located through the following
steps
**** !image-2025-11-17-11-33-45-545.png!
*
**
*** {color:#de350b}*Does it meet the balanced scheduling result (each task
manager contains 3 tasks)?*{color}
*
** *{color:#de350b}For jobmanager.scheduler: Default{color}*
*** *{color:#de350b}Does it meet the balanced scheduling result (each task
manager contains 3 tasks)?{color}*
h3. 3.1.2 Failover scenario test
h4. 3.1.2.1 Failover scenario test triggered by tasks
* Test case code
** Entry-point
class:{_}flip370.testing.slotsharinggroup.single.FlinkTestingJob{_}
** Code description: FlinkTestingJob describes a streaming job that contains a
default slot sharing group. The job includes a source operator with a
parallelism of 10 and a sink operator with a parallelism of 20.
** Job-level startup parameters are as follows: pass 300000 as a parameter to
the Flink job entry class, which indicates that the 0th subtask of the source
operator will throw a task exception every 5 minutes to trigger a job failover:
_flip370.testing.slotsharinggroup.single.FlinkTestingJob 300000_
* Description of the necessary configurations for an application cluster
*
**
*** _taskmanager.load-balance.mode: TASKS_
*
**
*** _taskmanager.numberOfTaskSlots: 2_
*
**
*** _restart-strategy.fixed-delay.attempts: 32_
*
**
*** _restart-strategy.fixed-delay.delay: 10s_
*
**
*** _jobmanager.scheduler: Adaptive/Default_
* Submit the flink job.
* Wait the task's exception for failover
* *{color:#de350b}Verification results{color}*
** *{color:#de350b}For jobmanager.scheduler: Adaptive{color}*
*** *{color:#de350b}Does it meet the balanced scheduling result (each task
manager contains 3 tasks)?{color}*
** *{color:#de350b}For jobmanager.scheduler: Default{color}*
*** *{color:#de350b}Does it meet the balanced scheduling result (each task
manager contains 3 tasks)?{color}*
h4. 3.1.2.2 Failover scenario test triggered by TaskManagers
* Test case code
* Entry-point
class:{_}flip370.testing.slotsharinggroup.single.FlinkTestingJob{_}
* Code description: FlinkTestingJob describes a streaming job that contains a
default slot sharing group. The job includes a source operator with a
parallelism of 10 and a sink operator with a parallelism of 20.
* Job-level startup parameters: N.A.
* Description of the necessary configurations for an application cluster
*
**
*** _taskmanager.load-balance.mode: TASKS_
*
**
*** _taskmanager.numberOfTaskSlots: 2_
*
**
*** _restart-strategy.fixed-delay.attempts: 32_
*
**
*** _restart-strategy.fixed-delay.delay: 10s_
*
**
*** _jobmanager.scheduler: Adaptive/Default_
* Submit the flink job.
* How to simulate TaskManager-level failures?
** Manually kill any one or more TaskManager instances/containers in the job
cluster.
* Wait for failover completed.
* {color:#de350b}*Verification results*{color}
** {color:#de350b}*For jobmanager.scheduler: Adaptive*{color}
*** {color:#de350b}*Does it meet the balanced scheduling result (each task
manager contains 3 tasks)?*{color}
** {color:#de350b}*For jobmanager.scheduler: Default*{color}
*** {color:#de350b}*Does it meet the balanced scheduling result (each task
manager contains 3 tasks)?*{color}
h2. 3.2 Test for a job that contains multiple slot sharing groups
h3. 3.2.1 Regular job test
* Test case code
* Entry-point
class:{_}flip370.testing.slotsharinggroup.multiple.FlinkTestingJob{_}
* Code description: FlinkTestingJob describes a streaming job that contains a
default slot sharing group and an ssg2 slot sharing group. Each slot sharing
group contains a source operator with a parallelism of 10 and a sink operator
with a parallelism of 20.
* Job-level startup parameters: N.A.
* Description of the necessary configurations for an application cluster
*
**
*** _taskmanager.load-balance.mode: TASKS_
*
**
*** _taskmanager.numberOfTaskSlots: 2_
*
**
*** _restart-strategy.fixed-delay.attempts: 32_
*
**
*** _restart-strategy.fixed-delay.delay: 10s_
*
**
*** _jobmanager.scheduler: Adaptive/Default_
* Submit the flink job.
* *{color:#de350b}Verification results{color}*
** *{color:#de350b}For jobmanager.scheduler: Adaptive{color}*
*** *{color:#de350b}Does it meet the balanced scheduling result (each task
manager contains 3 tasks)?{color}*
** *{color:#de350b}For jobmanager.scheduler: Default{color}*
*** *{color:#de350b}Does it meet the balanced scheduling result (each task
manager contains 3 tasks)?{color}*
h3. 3.2.2 Failover scenario test
h4. 3.2.2.1 Failover scenario test triggered by tasks
* Test case code
* Entry-point
class:{_}flip370.testing.slotsharinggroup.multiple.FlinkTestingJob{_}
* Code description: FlinkTestingJob describes a streaming job that contains a
default slot sharing group and an ssg2 slot sharing group. Each slot sharing
group contains a source operator with a parallelism of 10 and a sink operator
with a parallelism of 20.
* Job-level startup parameters are as follows: pass 300000 as a parameter to
the Flink job entry class, which indicates that the 0th subtask of the source
operator will throw a task exception every 5 minutes to trigger a job failover:
_flip370.testing.slotsharinggroup.single.FlinkTestingJob 300000_
* Description of the necessary configurations for an application cluster
*
**
*** _taskmanager.load-balance.mode: TASKS_
*
**
*** _taskmanager.numberOfTaskSlots: 2_
*
**
*** _restart-strategy.fixed-delay.attempts: 32_
*
**
*** _restart-strategy.fixed-delay.delay: 10s_
*
**
*** _jobmanager.scheduler: Adaptive/Default_
* Submit the flink job.
* Wait the task's exception for failover
* {color:#de350b}*Verification results*{color}
** {color:#de350b}*For jobmanager.scheduler: Adaptive*{color}
*** {color:#de350b}*Does it meet the balanced scheduling result (each task
manager contains 3 tasks)?*{color}
** {color:#de350b}*For jobmanager.scheduler: Default*{color}
*** {color:#de350b}*Does it meet the balanced scheduling result (each task
manager contains 3 tasks)?*{color}
h4. 3.2.2.2 Failover scenario test triggered by TaskManagers
* Test case code
* Entry-point
class:{_}flip370.testing.slotsharinggroup.multiple.FlinkTestingJob{_}
* Code description: FlinkTestingJob describes a streaming job that contains a
default slot sharing group and an ssg2 slot sharing group. Each slot sharing
group contains a source operator with a parallelism of 10 and a sink operator
with a parallelism of 20.
* Job-level startup parameters: N.A.
* Description of the necessary configurations for an application cluster
*
**
*** _taskmanager.load-balance.mode: TASKS_
*
**
*** _taskmanager.numberOfTaskSlots: 2_
*
**
*** _restart-strategy.fixed-delay.attempts: 32_
*
**
*** _restart-strategy.fixed-delay.delay: 10s_
*
**
*** _jobmanager.scheduler: Adaptive/Default_
* Submit the flink job.
* How to simulate TaskManager-level failures?
** Manually kill any one or more TaskManager instances/containers in the job
cluster.
* Wait for failover completed.
* *{color:#de350b}Verification results{color}*
** *{color:#de350b}For jobmanager.scheduler: Adaptive{color}*
*** *{color:#de350b}Does it meet the balanced scheduling result (each task
manager contains 3 tasks)?{color}*
** *{color:#de350b}For jobmanager.scheduler: Default{color}*
*** *{color:#de350b}Does it meet the balanced scheduling result (each task
manager contains 3 tasks)?{color}*
Ping [~RocMarshal] if there’re any issues during the testing
> Release Testing: Verify FLIP-370: Support Balanced Tasks Scheduling
> -------------------------------------------------------------------
>
> Key: FLINK-38687
> URL: https://issues.apache.org/jira/browse/FLINK-38687
> Project: Flink
> Issue Type: Sub-task
> Components: Runtime / Coordination
> Reporter: RocMarshal
> Priority: Major
>
> The original testing guide doc is here :
> [https://docs.google.com/document/d/1ZXSwtwGeSxy8L2AHdpTnumhXNWWisho_a8dcxRYSvsk/edit?tab=t.0#heading=h.1vcje3u1wogz]
> And the content as follows:
> h1. 1 Motivation
> This document primarily introduces the core working principles of the
> functionality introduced by Flip-370, as well as the key test cases that
> cross-team testing should focus on to verify the correctness of the feature.
> h1. 2 You may need to be familiar with the core logic of balanced scheduling
> Please refer to this
> [page|https://nightlies.apache.org/flink/flink-docs-release-2.2/docs/deployment/tasks-scheduling/balanced_tasks_scheduling/].
>
> h1. 3 Constructing and validating test cases
> As stated in the [FLIP|https://cwiki.apache.org/confluence/x/U56zDw]
> document, task balanced scheduling is based on the SlotPool perspective of
> the JobMaster to perform balanced task scheduling for a job. Therefore, all
> test cases in this test can be verified under the application execution mode
> (regardless of whether resources come from onYarn/onKubernetes).
> Testing jobs: [https://github.com/RocMarshal/flip370-testing-jobs]
>
> h2. 3.1 Test for a job that contains a slot sharing group
> h3. 3.1.1 Regular job test
> * Test case code
> * Entry-point
> class:{_}flip370.testing.slotsharinggroup.single.FlinkTestingJob{_}
> * Code description: FlinkTestingJob describes a streaming job that contains
> a default slot sharing group. The job includes a source operator with a
> parallelism of 10 and a sink operator with a parallelism of 20.
> * Job-level startup parameters:N.A
> * Description of the necessary configurations for an application cluster
> *
> **
> *** _taskmanager.load-balance.mode: TASKS_
> *
> **
> *** _taskmanager.numberOfTaskSlots: 2_
> *
> **
> *** _restart-strategy.fixed-delay.attempts: 32_
> *
> **
> *** _restart-strategy.fixed-delay.delay: 10s_
> *
> **
> *** _jobmanager.scheduler: Adaptive/Default_
> * Submit the flink job.
> * Verification results
> ** *{color:#de350b}For jobmanager.scheduler: Adaptive{color}*
> *** Obtain the taskmanager on which each task is located through the
> following steps
> **** !image-2025-11-17-11-33-45-545.png!
> *
> **
> *** {color:#de350b}*Does it meet the balanced scheduling result (each task
> manager contains 3 tasks)?*{color}
> *
> ** *{color:#de350b}For jobmanager.scheduler: Default{color}*
> *** *{color:#de350b}Does it meet the balanced scheduling result (each task
> manager contains 3 tasks)?{color}*
> h3. 3.1.2 Failover scenario test
> h4. 3.1.2.1 Failover scenario test triggered by tasks
> * Test case code
> ** Entry-point
> class:{_}flip370.testing.slotsharinggroup.single.FlinkTestingJob{_}
> ** Code description: FlinkTestingJob describes a streaming job that contains
> a default slot sharing group. The job includes a source operator with a
> parallelism of 10 and a sink operator with a parallelism of 20.
> ** Job-level startup parameters are as follows: pass 300000 as a parameter
> to the Flink job entry class, which indicates that the 0th subtask of the
> source operator will throw a task exception every 5 minutes to trigger a job
> failover: _flip370.testing.slotsharinggroup.single.FlinkTestingJob 300000_
> * Description of the necessary configurations for an application cluster
> *
> **
> *** _taskmanager.load-balance.mode: TASKS_
> *
> **
> *** _taskmanager.numberOfTaskSlots: 2_
> *
> **
> *** _restart-strategy.fixed-delay.attempts: 32_
> *
> **
> *** _restart-strategy.fixed-delay.delay: 10s_
> *
> **
> *** _jobmanager.scheduler: Adaptive/Default_
> * Submit the flink job.
> * Wait the task's exception for failover
> * *{color:#de350b}Verification results{color}*
> ** *{color:#de350b}For jobmanager.scheduler: Adaptive{color}*
> *** *{color:#de350b}Does it meet the balanced scheduling result (each task
> manager contains 3 tasks)?{color}*
> ** *{color:#de350b}For jobmanager.scheduler: Default{color}*
> *** *{color:#de350b}Does it meet the balanced scheduling result (each task
> manager contains 3 tasks)?{color}*
> h4. 3.1.2.2 Failover scenario test triggered by TaskManagers
> * Test case code
> * Entry-point
> class:{_}flip370.testing.slotsharinggroup.single.FlinkTestingJob{_}
> * Code description: FlinkTestingJob describes a streaming job that contains
> a default slot sharing group. The job includes a source operator with a
> parallelism of 10 and a sink operator with a parallelism of 20.
> * Job-level startup parameters: N.A.
> * Description of the necessary configurations for an application cluster
> *
> **
> *** _taskmanager.load-balance.mode: TASKS_
> *
> **
> *** _taskmanager.numberOfTaskSlots: 2_
> *
> **
> *** _restart-strategy.fixed-delay.attempts: 32_
> *
> **
> *** _restart-strategy.fixed-delay.delay: 10s_
> *
> **
> *** _jobmanager.scheduler: Adaptive/Default_
> * Submit the flink job.
> * How to simulate TaskManager-level failures?
> ** Manually kill any one or more TaskManager instances/containers in the job
> cluster.
> * Wait for failover completed.
> * {color:#de350b}*Verification results*{color}
> ** {color:#de350b}*For jobmanager.scheduler: Adaptive*{color}
> *** {color:#de350b}*Does it meet the balanced scheduling result (each task
> manager contains 3 tasks)?*{color}
> ** {color:#de350b}*For jobmanager.scheduler: Default*{color}
> *** {color:#de350b}*Does it meet the balanced scheduling result (each task
> manager contains 3 tasks)?*{color}
> h2. 3.2 Test for a job that contains multiple slot sharing groups
> h3. 3.2.1 Regular job test
> * Test case code
> * Entry-point
> class:{_}flip370.testing.slotsharinggroup.multiple.FlinkTestingJob{_}
> * Code description: FlinkTestingJob describes a streaming job that contains
> a default slot sharing group and an ssg2 slot sharing group. Each slot
> sharing group contains a source operator with a parallelism of 10 and a sink
> operator with a parallelism of 20.
> * Job-level startup parameters: N.A.
> * Description of the necessary configurations for an application cluster
> *
> **
> *** _taskmanager.load-balance.mode: TASKS_
> *
> **
> *** _taskmanager.numberOfTaskSlots: 2_
> *
> **
> *** _restart-strategy.fixed-delay.attempts: 32_
> *
> **
> *** _restart-strategy.fixed-delay.delay: 10s_
> *
> **
> *** _jobmanager.scheduler: Adaptive/Default_
> * Submit the flink job.
> * *{color:#de350b}Verification results{color}*
> ** *{color:#de350b}For jobmanager.scheduler: Adaptive{color}*
> *** *{color:#de350b}Does it meet the balanced scheduling result (each task
> manager contains 3 tasks)?{color}*
> ** *{color:#de350b}For jobmanager.scheduler: Default{color}*
> *** *{color:#de350b}Does it meet the balanced scheduling result (each task
> manager contains 3 tasks)?{color}*
> h3. 3.2.2 Failover scenario test
> h4. 3.2.2.1 Failover scenario test triggered by tasks
> * Test case code
> * Entry-point
> class:{_}flip370.testing.slotsharinggroup.multiple.FlinkTestingJob{_}
> * Code description: FlinkTestingJob describes a streaming job that contains
> a default slot sharing group and an ssg2 slot sharing group. Each slot
> sharing group contains a source operator with a parallelism of 10 and a sink
> operator with a parallelism of 20.
> * Job-level startup parameters are as follows: pass 300000 as a parameter to
> the Flink job entry class, which indicates that the 0th subtask of the source
> operator will throw a task exception every 5 minutes to trigger a job
> failover: _flip370.testing.slotsharinggroup.single.FlinkTestingJob 300000_
> * Description of the necessary configurations for an application cluster
> *
> **
> *** _taskmanager.load-balance.mode: TASKS_
> *
> **
> *** _taskmanager.numberOfTaskSlots: 2_
> *
> **
> *** _restart-strategy.fixed-delay.attempts: 32_
> *
> **
> *** _restart-strategy.fixed-delay.delay: 10s_
> *
> **
> *** _jobmanager.scheduler: Adaptive/Default_
> * Submit the flink job.
> * Wait the task's exception for failover
> * {color:#de350b}*Verification results*{color}
> ** {color:#de350b}*For jobmanager.scheduler: Adaptive*{color}
> *** {color:#de350b}*Does it meet the balanced scheduling result (each task
> manager contains 3 tasks)?*{color}
> ** {color:#de350b}*For jobmanager.scheduler: Default*{color}
> *** {color:#de350b}*Does it meet the balanced scheduling result (each task
> manager contains 3 tasks)?*{color}
> h4. 3.2.2.2 Failover scenario test triggered by TaskManagers
> * Test case code
> * Entry-point
> class:{_}flip370.testing.slotsharinggroup.multiple.FlinkTestingJob{_}
> * Code description: FlinkTestingJob describes a streaming job that contains
> a default slot sharing group and an ssg2 slot sharing group. Each slot
> sharing group contains a source operator with a parallelism of 10 and a sink
> operator with a parallelism of 20.
> * Job-level startup parameters: N.A.
> * Description of the necessary configurations for an application cluster
> *
> **
> *** _taskmanager.load-balance.mode: TASKS_
> *
> **
> *** _taskmanager.numberOfTaskSlots: 2_
> *
> **
> *** _restart-strategy.fixed-delay.attempts: 32_
> *
> **
> *** _restart-strategy.fixed-delay.delay: 10s_
> *
> **
> *** _jobmanager.scheduler: Adaptive/Default_
> * Submit the flink job.
> * How to simulate TaskManager-level failures?
> ** Manually kill any one or more TaskManager instances/containers in the job
> cluster.
> * Wait for failover completed.
> * *{color:#de350b}Verification results{color}*
> ** *{color:#de350b}For jobmanager.scheduler: Adaptive{color}*
> *** *{color:#de350b}Does it meet the balanced scheduling result (each task
> manager contains 3 tasks)?{color}*
> ** *{color:#de350b}For jobmanager.scheduler: Default{color}*
> *** *{color:#de350b}Does it meet the balanced scheduling result (each task
> manager contains 3 tasks)?{color}*
> Ping [~RocMarshal] if there’re any issues during the testing
--
This message was sent by Atlassian Jira
(v8.20.10#820010)