[ 
https://issues.apache.org/jira/browse/FLINK-27031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17539477#comment-17539477
 ] 

Anton Kalashnikov commented on FLINK-27031:
-------------------------------------------

As I can see we setup Virtual Channels inside 
`StreamTaskNetworkInputFactory#create` when `InflightDataRescalingDescriptor` 
is not equal to 'NO_RESCALE'. It happens only if the subtask receives the 
'JobManagerTaskRestore'(see 
'TaskStateManagerImpl#getInputRescalingDescriptor').  So if the subtask doesn't 
receive  'JobManagerTaskRestore' but it receives the old state which should be 
filtered then we have an error described in this ticket. 
As I understand, It is only possible when the subtask doesn't have any states 
for restoring, but this task's upstream has output channel states. According to 
current logic(see last 'for' in 'StateAssignmentOperation#assignStates'), we 
assign states only based on states of current subtask and ignore its upstream 
states which actually leads to the described problem. 
So I think we can expand the condition for assigning the states by including 
upstream output channel states check.(which I have done in 
https://github.com/apache/flink/pull/19702).  Or we can think of another fix 
that should somehow notify the subtask about creating the Virtual Channels. 

> ChangelogRescalingITCase.test failed due to IllegalStateException
> -----------------------------------------------------------------
>
>                 Key: FLINK-27031
>                 URL: https://issues.apache.org/jira/browse/FLINK-27031
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Network, Runtime / State Backends
>    Affects Versions: 1.15.0, 1.16.0
>            Reporter: Matthias Pohl
>            Assignee: Anton Kalashnikov
>            Priority: Critical
>              Labels: pull-request-available
>
> [This 
> build|https://dev.azure.com/mapohl/flink/_build/results?buildId=923&view=logs&j=cc649950-03e9-5fae-8326-2f1ad744b536&t=a9a20597-291c-5240-9913-a731d46d6dd1&l=12961]
>  failed in {{ChangelogRescalingITCase.test}}:
> {code}
> Apr 01 20:26:53 Caused by: java.lang.IllegalArgumentException: Key group 94 
> is not in KeyGroupRange{startKeyGroup=96, endKeyGroup=127}. Unless you're 
> directly using low level state access APIs, this is most likely caused by 
> non-deterministic shuffle key (hashCode and equals implementation).
> Apr 01 20:26:53       at 
> org.apache.flink.runtime.state.KeyGroupRangeOffsets.newIllegalKeyGroupException(KeyGroupRangeOffsets.java:37)
> Apr 01 20:26:53       at 
> org.apache.flink.runtime.state.heap.StateTable.getMapForKeyGroup(StateTable.java:305)
> Apr 01 20:26:53       at 
> org.apache.flink.runtime.state.heap.StateTable.get(StateTable.java:261)
> Apr 01 20:26:53       at 
> org.apache.flink.runtime.state.heap.StateTable.get(StateTable.java:143)
> Apr 01 20:26:53       at 
> org.apache.flink.runtime.state.heap.HeapListState.add(HeapListState.java:94)
> Apr 01 20:26:53       at 
> org.apache.flink.state.changelog.ChangelogListState.add(ChangelogListState.java:78)
> Apr 01 20:26:53       at 
> org.apache.flink.streaming.runtime.operators.windowing.WindowOperator.processElement(WindowOperator.java:404)
> Apr 01 20:26:53       at 
> org.apache.flink.streaming.runtime.tasks.ChainingOutput.pushToOperator(ChainingOutput.java:99)
> Apr 01 20:26:53       at 
> org.apache.flink.streaming.runtime.tasks.ChainingOutput.collect(ChainingOutput.java:80)
> Apr 01 20:26:53       at 
> org.apache.flink.streaming.runtime.tasks.ChainingOutput.collect(ChainingOutput.java:39)
> Apr 01 20:26:53       at 
> org.apache.flink.streaming.api.operators.CountingOutput.collect(CountingOutput.java:56)
> Apr 01 20:26:53       at 
> org.apache.flink.streaming.api.operators.CountingOutput.collect(CountingOutput.java:29)
> Apr 01 20:26:53       at 
> org.apache.flink.streaming.api.operators.StreamMap.processElement(StreamMap.java:38)
> Apr 01 20:26:53       at 
> org.apache.flink.streaming.runtime.tasks.OneInputStreamTask$StreamTaskNetworkOutput.emitRecord(OneInputStreamTask.java:233)
> Apr 01 20:26:53       at 
> org.apache.flink.streaming.runtime.io.AbstractStreamTaskNetworkInput.processElement(AbstractStreamTaskNetworkInput.java:134)
> Apr 01 20:26:53       at 
> org.apache.flink.streaming.runtime.io.AbstractStreamTaskNetworkInput.emitNext(AbstractStreamTaskNetworkInput.java:105)
> Apr 01 20:26:53       at 
> org.apache.flink.streaming.runtime.io.StreamOneInputProcessor.processInput(StreamOneInputProcessor.java:65)
> Apr 01 20:26:53       at 
> org.apache.flink.streaming.runtime.tasks.StreamTask.processInput(StreamTask.java:531)
> Apr 01 20:26:53       at 
> org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.runMailboxLoop(MailboxProcessor.java:227)
> Apr 01 20:26:53       at 
> org.apache.flink.streaming.runtime.tasks.StreamTask.runMailboxLoop(StreamTask.java:841)
> Apr 01 20:26:53       at 
> org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:767)
> Apr 01 20:26:53       at 
> org.apache.flink.runtime.taskmanager.Task.runWithSystemExitMonitoring(Task.java:948)
> Apr 01 20:26:53       at 
> org.apache.flink.runtime.taskmanager.Task.restoreAndInvoke(Task.java:927)
> Apr 01 20:26:53       at 
> org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:741)
> Apr 01 20:26:53       at 
> org.apache.flink.runtime.taskmanager.Task.run(Task.java:563)
> Apr 01 20:26:53       at java.lang.Thread.run(Thread.java:748)
> {code}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

Reply via email to