[jira] [Commented] (FLINK-38687) Release Testing: Verify FLIP-370: Support Balanced Tasks Scheduling

Ferenc Csaky (Jira) Mon, 24 Nov 2025 05:09:06 -0800


    [ 
https://issues.apache.org/jira/browse/FLINK-38687?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18040325#comment-18040325
 ]


Ferenc Csaky commented on FLINK-38687:
--------------------------------------

I agree with [~RocMarshal] about marking this as a non-blocker. Creating a 
follow-up ticket for it makes sense.

There is not much I can add on how to reproduce, I already included every 
relevant information in my test comment, but to synthesize the key points in a 
more easy-to-read context:
# Test job: [https://github.com/RocMarshal/flip370-testing-jobs] 
({{flip370.testing.slotsharinggroup.multiple.FlinkTestingJob}})
# Change parallelism: 10 -> 3; 20 -> 6;
# Add fixed delay restart strategy (10s delay, 100 restart attempts)
# Apply config to the exec env
# Build with JDK17: {{mvn clean package}}
# {{./bin/start-cluster.sh}}
# Start 8 more TMs to have 9 2slot TMs in total: {{./bin/taskmanager.sh start}} 
8x
# Depoy job: {{flink run -d -c 
flip370.testing.slotsharinggroup.multiple.FlinkTestingJob 
/path/to/flink-demo.jar}}
# Get TM PIDs: {{ps aux | grep TaskManagerRunner}}
# Start killing TMs (1 or more via {{kill -9 <pid>}})
# Observe the task distribution on the UI
# Spin up more TMs if there is no more free resource available to be able to 
continue the kill -> failover loop

About the UI points, it is definitely not tied to this feature, I just 
expressed my general pain. :) Starting a discussion and possibly a FLIP about 
it seems to be the way to go forward with it.

> Release Testing: Verify FLIP-370: Support Balanced Tasks Scheduling
> -------------------------------------------------------------------
>
>                 Key: FLINK-38687
>                 URL: https://issues.apache.org/jira/browse/FLINK-38687
>             Project: Flink
>          Issue Type: Sub-task
>          Components: Runtime / Coordination
>            Reporter: RocMarshal
>            Assignee: Ferenc Csaky
>            Priority: Major
>             Fix For: 2.2.0
>
>         Attachments: Screenshot 2025-11-22 at 9.17.03.png, Screenshot 
> 2025-11-22 at 9.17.27.png, Screenshot 2025-11-22 at 9.17.54.png
>
>
> The original testing guide doc is here : 
> [https://docs.google.com/document/d/1ZXSwtwGeSxy8L2AHdpTnumhXNWWisho_a8dcxRYSvsk/edit?tab=t.0#heading=h.1vcje3u1wogz]
> And the content as follows:
> h1. 1 Motivation
> This document primarily introduces the core working principles of the 
> functionality introduced by Flip-370, as well as the key test cases that 
> cross-team testing should focus on to verify the correctness of the feature.
> h1. 2 You may need to be familiar with the core logic of balanced scheduling
> Please refer to this 
> [page|https://nightlies.apache.org/flink/flink-docs-release-2.2/docs/deployment/tasks-scheduling/balanced_tasks_scheduling/].
>  
> h1. 3 Constructing and validating test cases
> As stated in the [FLIP|https://cwiki.apache.org/confluence/x/U56zDw] 
> document, task balanced scheduling is based on the SlotPool perspective of 
> the JobMaster to perform balanced task scheduling for a job. Therefore, all 
> test cases in this test can be verified under the application execution mode 
> (regardless of whether resources come from onYarn/onKubernetes).
> Testing jobs: [https://github.com/RocMarshal/flip370-testing-jobs] 
>  
> h2. 3.1 Test for a job that contains a slot sharing group
> h3. 3.1.1 Regular job test
>  * Test case code
>  * Entry-point 
> class：{_}flip370.testing.slotsharinggroup.single.FlinkTestingJob{_}
>  * Code description: FlinkTestingJob describes a streaming job that contains 
> a default slot sharing group. The job includes a source operator with a 
> parallelism of 10 and a sink operator with a parallelism of 20.
>  * Job-level startup parameters：N.A
>  * Description of the necessary configurations for an application cluster
>  * 
>  ** 
>  *** _taskmanager.load-balance.mode: TASKS_
>  * 
>  ** 
>  *** _taskmanager.numberOfTaskSlots: 2_
>  * 
>  ** 
>  *** _restart-strategy.fixed-delay.attempts: 32_
>  * 
>  ** 
>  *** _restart-strategy.fixed-delay.delay: 10s_
>  * 
>  ** 
>  *** _jobmanager.scheduler: Adaptive/Default_ 
>  * Submit the flink job.
>  * Verification results
>  ** *{color:#de350b}For jobmanager.scheduler: Adaptive{color}*
>  *** Obtain the taskmanager on which each task is located through the 
> following steps 
>  **** !image-2025-11-17-11-33-45-545.png!
>  * 
>  ** 
>  *** {color:#de350b}*Does it meet the balanced scheduling result (each task 
> manager contains 3 tasks)？*{color}
>  * 
>  ** *{color:#de350b}For jobmanager.scheduler: Default{color}*
>  *** *{color:#de350b}Does it meet the balanced scheduling result (each task 
> manager contains 3 tasks)？{color}*
> h3. 3.1.2 Failover scenario test
> h4. 3.1.2.1 Failover scenario test triggered by tasks
>  * Test case code
>  ** Entry-point 
> class：{_}flip370.testing.slotsharinggroup.single.FlinkTestingJob{_}
>  ** Code description: FlinkTestingJob describes a streaming job that contains 
> a default slot sharing group. The job includes a source operator with a 
> parallelism of 10 and a sink operator with a parallelism of 20.
>  ** Job-level startup parameters are as follows: pass 300000 as a parameter 
> to the Flink job entry class, which indicates that the 0th subtask of the 
> source operator will throw a task exception every 5 minutes to trigger a job 
> failover:  _flip370.testing.slotsharinggroup.single.FlinkTestingJob 300000_
>  * Description of the necessary configurations for an application cluster
>  * 
>  ** 
>  *** _taskmanager.load-balance.mode: TASKS_
>  * 
>  ** 
>  *** _taskmanager.numberOfTaskSlots: 2_
>  * 
>  ** 
>  *** _restart-strategy.fixed-delay.attempts: 32_
>  * 
>  ** 
>  *** _restart-strategy.fixed-delay.delay: 10s_
>  * 
>  ** 
>  *** _jobmanager.scheduler: Adaptive/Default_ 
>  * Submit the flink job.
>  * Wait the task's exception for failover
>  * *{color:#de350b}Verification results{color}*
>  ** *{color:#de350b}For jobmanager.scheduler: Adaptive{color}*
>  *** *{color:#de350b}Does it meet the balanced scheduling result (each task 
> manager contains 3 tasks)?{color}*
>  ** *{color:#de350b}For jobmanager.scheduler: Default{color}*
>  *** *{color:#de350b}Does it meet the balanced scheduling result (each task 
> manager contains 3 tasks)?{color}*
> h4. 3.1.2.2 Failover scenario test triggered by TaskManagers
>  * Test case code
>  * Entry-point 
> class：{_}flip370.testing.slotsharinggroup.single.FlinkTestingJob{_}
>  * Code description: FlinkTestingJob describes a streaming job that contains 
> a default slot sharing group. The job includes a source operator with a 
> parallelism of 10 and a sink operator with a parallelism of 20.
>  * Job-level startup parameters: N.A.
>  * Description of the necessary configurations for an application cluster
>  * 
>  ** 
>  *** _taskmanager.load-balance.mode: TASKS_
>  * 
>  ** 
>  *** _taskmanager.numberOfTaskSlots: 2_
>  * 
>  ** 
>  *** _restart-strategy.fixed-delay.attempts: 32_
>  * 
>  ** 
>  *** _restart-strategy.fixed-delay.delay: 10s_
>  * 
>  ** 
>  *** _jobmanager.scheduler: Adaptive/Default_ 
>  * Submit the flink job.
>  * How to simulate TaskManager-level failures？
>  ** Manually kill any one or more TaskManager instances/containers in the job 
> cluster.
>  * Wait for failover completed.
>  * {color:#de350b}*Verification results*{color}
>  ** {color:#de350b}*For jobmanager.scheduler: Adaptive*{color}
>  *** {color:#de350b}*Does it meet the balanced scheduling result (each task 
> manager contains 3 tasks)?*{color}
>  ** {color:#de350b}*For jobmanager.scheduler: Default*{color}
>  *** {color:#de350b}*Does it meet the balanced scheduling result (each task 
> manager contains 3 tasks)?*{color}
> h2. 3.2 Test for a job that contains multiple slot sharing groups
> h3. 3.2.1 Regular job test
>  * Test case code
>  * Entry-point 
> class：{_}flip370.testing.slotsharinggroup.multiple.FlinkTestingJob{_}
>  * Code description: FlinkTestingJob describes a streaming job that contains 
> a default slot sharing group and an ssg2 slot sharing group. Each slot 
> sharing group contains a source operator with a parallelism of 10 and a sink 
> operator with a parallelism of 20.
>  * Job-level startup parameters: N.A.
>  * Description of the necessary configurations for an application cluster
>  * 
>  ** 
>  *** _taskmanager.load-balance.mode: TASKS_
>  * 
>  ** 
>  *** _taskmanager.numberOfTaskSlots: 2_
>  * 
>  ** 
>  *** _restart-strategy.fixed-delay.attempts: 32_
>  * 
>  ** 
>  *** _restart-strategy.fixed-delay.delay: 10s_
>  * 
>  ** 
>  *** _jobmanager.scheduler: Adaptive/Default_ 
>  * Submit the flink job.
>  * *{color:#de350b}Verification results{color}*
>  ** *{color:#de350b}For jobmanager.scheduler: Adaptive{color}*
>  *** *{color:#de350b}Does it meet the balanced scheduling result (each task 
> manager contains 3 tasks)?{color}*
>  ** *{color:#de350b}For jobmanager.scheduler: Default{color}*
>  *** *{color:#de350b}Does it meet the balanced scheduling result (each task 
> manager contains 3 tasks)?{color}*
> h3. 3.2.2 Failover scenario test
> h4. 3.2.2.1  Failover scenario test triggered by tasks
>  * Test case code
>  * Entry-point 
> class：{_}flip370.testing.slotsharinggroup.multiple.FlinkTestingJob{_}
>  * Code description: FlinkTestingJob describes a streaming job that contains 
> a default slot sharing group and an ssg2 slot sharing group. Each slot 
> sharing group contains a source operator with a parallelism of 10 and a sink 
> operator with a parallelism of 20.
>  * Job-level startup parameters are as follows: pass 300000 as a parameter to 
> the Flink job entry class, which indicates that the 0th subtask of the source 
> operator will throw a task exception every 5 minutes to trigger a job 
> failover:  _flip370.testing.slotsharinggroup.single.FlinkTestingJob 300000_
>  * Description of the necessary configurations for an application cluster
>  * 
>  ** 
>  *** _taskmanager.load-balance.mode: TASKS_
>  * 
>  ** 
>  *** _taskmanager.numberOfTaskSlots: 2_
>  * 
>  ** 
>  *** _restart-strategy.fixed-delay.attempts: 32_
>  * 
>  ** 
>  *** _restart-strategy.fixed-delay.delay: 10s_
>  * 
>  ** 
>  *** _jobmanager.scheduler: Adaptive/Default_ 
>  * Submit the flink job.
>  * Wait the task's exception for failover
>  * {color:#de350b}*Verification results*{color}
>  ** {color:#de350b}*For jobmanager.scheduler: Adaptive*{color}
>  *** {color:#de350b}*Does it meet the balanced scheduling result (each task 
> manager contains 3 tasks)?*{color}
>  ** {color:#de350b}*For jobmanager.scheduler: Default*{color}
>  *** {color:#de350b}*Does it meet the balanced scheduling result (each task 
> manager contains 3 tasks)?*{color}
> h4. 3.2.2.2 Failover scenario test triggered by TaskManagers
>  * Test case code
>  * Entry-point 
> class：{_}flip370.testing.slotsharinggroup.multiple.FlinkTestingJob{_}
>  * Code description: FlinkTestingJob describes a streaming job that contains 
> a default slot sharing group and an ssg2 slot sharing group. Each slot 
> sharing group contains a source operator with a parallelism of 10 and a sink 
> operator with a parallelism of 20.
>  * Job-level startup parameters: N.A.
>  * Description of the necessary configurations for an application cluster
>  * 
>  ** 
>  *** _taskmanager.load-balance.mode: TASKS_
>  * 
>  ** 
>  *** _taskmanager.numberOfTaskSlots: 2_
>  * 
>  ** 
>  *** _restart-strategy.fixed-delay.attempts: 32_
>  * 
>  ** 
>  *** _restart-strategy.fixed-delay.delay: 10s_
>  * 
>  ** 
>  *** _jobmanager.scheduler: Adaptive/Default_ 
>  * Submit the flink job.
>  * How to simulate TaskManager-level failures？
>  ** Manually kill any one or more TaskManager instances/containers in the job 
> cluster.
>  * Wait for failover completed.
>  * *{color:#de350b}Verification results{color}*
>  ** *{color:#de350b}For jobmanager.scheduler: Adaptive{color}*
>  *** *{color:#de350b}Does it meet the balanced scheduling result (each task 
> manager contains 3 tasks)?{color}*
>  ** *{color:#de350b}For jobmanager.scheduler: Default{color}*
>  *** *{color:#de350b}Does it meet the balanced scheduling result (each task 
> manager contains 3 tasks)?{color}*
> Ping [~RocMarshal] if there’re any issues during the testing 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (FLINK-38687) Release Testing: Verify FLIP-370: Support Balanced Tasks Scheduling

Reply via email to