Feifan Wang created FLINK-36743: ----------------------------------- Summary: Rescale from unaligend checkpoint failed Key: FLINK-36743 URL: https://issues.apache.org/jira/browse/FLINK-36743 Project: Flink Issue Type: Bug Components: Runtime / Checkpointing Reporter: Feifan Wang
We encountered the following exception when scaling down a job from 5600 concurrent users to 4200 concurrent users: {code:java} 2024-11-12 19:20:54,308 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph [] - Sink: xxxxxx (1358/1400) (80ea0855521cb3249d011e3166823e47_56a38c81905da002db3a9d8f9d395f2b_1357_0) switched from RUNNING to FAILED on container_e33_1725519807238_6894116_01_000825 @ yg- java.lang.IllegalStateException: Cannot select SubtaskConnectionDescriptor{inputSubtaskIndex=0, outputSubtaskIndex=4071}; known channels are [SubtaskConnectionDescriptor{inputSubtaskIndex=1357, outputSubtaskIndex=0}, SubtaskConnectionDescriptor{inputSubtaskIndex=1357, outputSubtaskIndex=4200}] at org.apache.flink.streaming.runtime.io.recovery.DemultiplexingRecordDeserializer.select(DemultiplexingRecordDeserializer.java:121) ~[flink-dist-1.16.1-mt001-SNAPSHOT.jar:1.16.1-mt001-SNAPSHOT] at org.apache.flink.streaming.runtime.io.recovery.RescalingStreamTaskNetworkInput.processEvent(RescalingStreamTaskNetworkInput.java:181) ~[flink-dist-1.16.1-mt001-SNAPSHOT.jar:1.16.1-mt001-SNAPSHOT] at org.apache.flink.streaming.runtime.io.AbstractStreamTaskNetworkInput.emitNext(AbstractStreamTaskNetworkInput.java:118) ~[flink-dist-1.16.1-mt001-SNAPSHOT.jar:1.16.1-mt001-SNAPSHOT] at org.apache.flink.streaming.runtime.io.StreamOneInputProcessor.processInput(StreamOneInputProcessor.java:65) ~[flink-dist-1.16.1-mt001-SNAPSHOT.jar:1.16.1-mt001-SNAPSHOT] at org.apache.flink.streaming.runtime.tasks.StreamTask.processInput(StreamTask.java:542) ~[flink-dist-1.16.1-mt001-SNAPSHOT.jar:1.16.1-mt001-SNAPSHOT] at org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.runMailboxLoop(MailboxProcessor.java:231) ~[flink-dist-1.16.1-mt001-SNAPSHOT.jar:1.16.1-mt001-SNAPSHOT] at org.apache.flink.streaming.runtime.tasks.StreamTask.runMailboxLoop(StreamTask.java:831) ~[flink-dist-1.16.1-mt001-SNAPSHOT.jar:1.16.1-mt001-SNAPSHOT] at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:780) ~[flink-dist-1.16.1-mt001-SNAPSHOT.jar:1.16.1-mt001-SNAPSHOT] at org.apache.flink.runtime.taskmanager.Task.runWithSystemExitMonitoring(Task.java:937) ~[flink-dist-1.16.1-mt001-SNAPSHOT.jar:1.16.1-mt001-SNAPSHOT] at org.apache.flink.runtime.taskmanager.Task.restoreAndInvoke(Task.java:916) ~[flink-dist-1.16.1-mt001-SNAPSHOT.jar:1.16.1-mt001-SNAPSHOT] at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:730) ~[flink-dist-1.16.1-mt001-SNAPSHOT.jar:1.16.1-mt001-SNAPSHOT] at org.apache.flink.runtime.taskmanager.Task.run(Task.java:550) ~[flink-dist-1.16.1-mt001-SNAPSHOT.jar:1.16.1-mt001-SNAPSHOT] at java.lang.Thread.run(Thread.java:748) ~[?:1.8.0_312] {code} * Flink version : 1.16.1 * unaligned checkpoint : enabled * log-based checkpoint : enabled The exception encountered when restore from chk-2718336, and it can successfully restore from chk-2718333. It looks like there is something wrong with the unaligned checkpoint when allocating in-flight data. Could you please help a look ? [~arvid] , [~pnowojski] -- This message was sent by Atlassian Jira (v8.20.10#820010)