[ https://issues.apache.org/jira/browse/FLINK-27031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17539477#comment-17539477 ]
Anton Kalashnikov commented on FLINK-27031: ------------------------------------------- As I can see we setup Virtual Channels inside `StreamTaskNetworkInputFactory#create` when `InflightDataRescalingDescriptor` is not equal to 'NO_RESCALE'. It happens only if the subtask receives the 'JobManagerTaskRestore'(see 'TaskStateManagerImpl#getInputRescalingDescriptor'). So if the subtask doesn't receive 'JobManagerTaskRestore' but it receives the old state which should be filtered then we have an error described in this ticket. As I understand, It is only possible when the subtask doesn't have any states for restoring, but this task's upstream has output channel states. According to current logic(see last 'for' in 'StateAssignmentOperation#assignStates'), we assign states only based on states of current subtask and ignore its upstream states which actually leads to the described problem. So I think we can expand the condition for assigning the states by including upstream output channel states check.(which I have done in https://github.com/apache/flink/pull/19702). Or we can think of another fix that should somehow notify the subtask about creating the Virtual Channels. > ChangelogRescalingITCase.test failed due to IllegalStateException > ----------------------------------------------------------------- > > Key: FLINK-27031 > URL: https://issues.apache.org/jira/browse/FLINK-27031 > Project: Flink > Issue Type: Bug > Components: Runtime / Network, Runtime / State Backends > Affects Versions: 1.15.0, 1.16.0 > Reporter: Matthias Pohl > Assignee: Anton Kalashnikov > Priority: Critical > Labels: pull-request-available > > [This > build|https://dev.azure.com/mapohl/flink/_build/results?buildId=923&view=logs&j=cc649950-03e9-5fae-8326-2f1ad744b536&t=a9a20597-291c-5240-9913-a731d46d6dd1&l=12961] > failed in {{ChangelogRescalingITCase.test}}: > {code} > Apr 01 20:26:53 Caused by: java.lang.IllegalArgumentException: Key group 94 > is not in KeyGroupRange{startKeyGroup=96, endKeyGroup=127}. Unless you're > directly using low level state access APIs, this is most likely caused by > non-deterministic shuffle key (hashCode and equals implementation). > Apr 01 20:26:53 at > org.apache.flink.runtime.state.KeyGroupRangeOffsets.newIllegalKeyGroupException(KeyGroupRangeOffsets.java:37) > Apr 01 20:26:53 at > org.apache.flink.runtime.state.heap.StateTable.getMapForKeyGroup(StateTable.java:305) > Apr 01 20:26:53 at > org.apache.flink.runtime.state.heap.StateTable.get(StateTable.java:261) > Apr 01 20:26:53 at > org.apache.flink.runtime.state.heap.StateTable.get(StateTable.java:143) > Apr 01 20:26:53 at > org.apache.flink.runtime.state.heap.HeapListState.add(HeapListState.java:94) > Apr 01 20:26:53 at > org.apache.flink.state.changelog.ChangelogListState.add(ChangelogListState.java:78) > Apr 01 20:26:53 at > org.apache.flink.streaming.runtime.operators.windowing.WindowOperator.processElement(WindowOperator.java:404) > Apr 01 20:26:53 at > org.apache.flink.streaming.runtime.tasks.ChainingOutput.pushToOperator(ChainingOutput.java:99) > Apr 01 20:26:53 at > org.apache.flink.streaming.runtime.tasks.ChainingOutput.collect(ChainingOutput.java:80) > Apr 01 20:26:53 at > org.apache.flink.streaming.runtime.tasks.ChainingOutput.collect(ChainingOutput.java:39) > Apr 01 20:26:53 at > org.apache.flink.streaming.api.operators.CountingOutput.collect(CountingOutput.java:56) > Apr 01 20:26:53 at > org.apache.flink.streaming.api.operators.CountingOutput.collect(CountingOutput.java:29) > Apr 01 20:26:53 at > org.apache.flink.streaming.api.operators.StreamMap.processElement(StreamMap.java:38) > Apr 01 20:26:53 at > org.apache.flink.streaming.runtime.tasks.OneInputStreamTask$StreamTaskNetworkOutput.emitRecord(OneInputStreamTask.java:233) > Apr 01 20:26:53 at > org.apache.flink.streaming.runtime.io.AbstractStreamTaskNetworkInput.processElement(AbstractStreamTaskNetworkInput.java:134) > Apr 01 20:26:53 at > org.apache.flink.streaming.runtime.io.AbstractStreamTaskNetworkInput.emitNext(AbstractStreamTaskNetworkInput.java:105) > Apr 01 20:26:53 at > org.apache.flink.streaming.runtime.io.StreamOneInputProcessor.processInput(StreamOneInputProcessor.java:65) > Apr 01 20:26:53 at > org.apache.flink.streaming.runtime.tasks.StreamTask.processInput(StreamTask.java:531) > Apr 01 20:26:53 at > org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.runMailboxLoop(MailboxProcessor.java:227) > Apr 01 20:26:53 at > org.apache.flink.streaming.runtime.tasks.StreamTask.runMailboxLoop(StreamTask.java:841) > Apr 01 20:26:53 at > org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:767) > Apr 01 20:26:53 at > org.apache.flink.runtime.taskmanager.Task.runWithSystemExitMonitoring(Task.java:948) > Apr 01 20:26:53 at > org.apache.flink.runtime.taskmanager.Task.restoreAndInvoke(Task.java:927) > Apr 01 20:26:53 at > org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:741) > Apr 01 20:26:53 at > org.apache.flink.runtime.taskmanager.Task.run(Task.java:563) > Apr 01 20:26:53 at java.lang.Thread.run(Thread.java:748) > {code} -- This message was sent by Atlassian Jira (v8.20.7#820007)