Igal Shilman created FLINK-21642: ------------------------------------ Summary: RequestReplyFunction recovery fails with a remote SDK Key: FLINK-21642 URL: https://issues.apache.org/jira/browse/FLINK-21642 Project: Flink Issue Type: Bug Components: Stateful Functions Reporter: Igal Shilman
While extending our smoke e2e test to use the remote SDKS I've stumbled upon a bug in the RequestReplyFunction. We get a unknown state exception after recovery. The exact scenario that trigger that bug is: # There was request in flight. # A failure occurs that causes the job to restart. # On restore, we start with no managed state # But we try to re-send to the SDK exactly the same ToFunction message. # That ToFunction contains state definitions from the previous attempt. (before the failure) # The SDK processes this message normally (it has all the state definitions that it knows) # The SDK responds with a state mutation. # The PersistedRemoteFunctionValues fails with unknown state. We need to treat the ToFunction messages as a retryBatch, instead of sending it as-is. -- This message was sent by Atlassian Jira (v8.3.4#803005)