[ https://issues.apache.org/jira/browse/FLINK-21642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
ASF GitHub Bot updated FLINK-21642: ----------------------------------- Labels: pull-request-available (was: ) > RequestReplyFunction recovery fails with a remote SDK > ----------------------------------------------------- > > Key: FLINK-21642 > URL: https://issues.apache.org/jira/browse/FLINK-21642 > Project: Flink > Issue Type: Bug > Components: Stateful Functions > Reporter: Igal Shilman > Priority: Major > Labels: pull-request-available > > While extending our smoke e2e test to use the remote SDKS I've stumbled upon > a bug in the RequestReplyFunction. We get a unknown state exception after > recovery. > The exact scenario that trigger that bug is: > # There was request in flight. > # A failure occurs that causes the job to restart. > # On restore, we start with no managed state > # But we try to re-send to the SDK exactly the same ToFunction message. > # That ToFunction contains state definitions from the previous attempt. > (before the failure) > # The SDK processes this message normally (it has all the state definitions > that it knows) > # The SDK responds with a state mutation. > # The PersistedRemoteFunctionValues fails with unknown state. > > We need to treat the ToFunction messages as a retryBatch, instead of sending > it as-is. > -- This message was sent by Atlassian Jira (v8.3.4#803005)