We are trying to make sure that we are handling the proper edge cases in our stateful tasks so that our processes can survive failure well.
Given the changelog will recreate the KV store (state) up to the point of time of the last durable changelog write(Ts), the checkpoint will start input from the point of time represented in the last durable checkpoint write(Ti) and the output will have messages from it at the 3rd point in time of the last durable output write(To), our current assumption is that in all recovery cases: Ti <= Ts <= To This means that some input may be "replayed" from the point of view of the KV store which is handled by normal at-least-once-delivery semantics processing and that we may duplicate output messages that would have been produced between Ts and To which is also consistent with at-least-once-delivery. However, I cannot find code that backs this assumptions and I'm hoping I've just missed it, because: If To < Ts, then we may drop output because the state assumed it was already written and due to timing of actual writes to kafka or durability concerns the output is not there. This is important for a job, for example, that emits "session started @ X" messages on the first message for any given session key. The state will see a repeated message as a duplicate and not emit the output. I think this is solvable in the job as long as To >= Ti, but I am not certain the solution is generally applicable to tasks where side-effects of previous input exist in the state and have an effect on future output. If Ts < Ti, then our stateful task will effectively drop input, even though it may have produced some or all of the output for those messages in its previous incarnation, as the state used for all future processes will not have the side effects of processing the messages between Ts and Ti. We see no solution for this at the task level as it would require collusion between two backing system (checkpoints and changelogs) to correct, presumably by rewinding Ti to Ts. Perhaps my code search failed because I was expecting some colluding system that would wait for output to write out before writing changelog entries and then again before checkpoints and that was to presumptive. Is there something about the code, the assumption or my edge analysis that I've missed to address this? -Bart ________________________________ This e-mail may contain CONFIDENTIAL AND PROPRIETARY INFORMATION and/or PRIVILEGED AND CONFIDENTIAL COMMUNICATION intended solely for the recipient and, therefore, may not be retransmitted to any party outside of the recipient's organization without the prior written consent of the sender. If you have received this e-mail in error please notify the sender immediately by telephone or reply e-mail and destroy the original message without making a copy. Deep Silver, Inc. accepts no liability for any losses or damages resulting from infected e-mail transmissions and viruses in e-mail attachments.