Hi - I have some questions regarding Flink's checkpointing, specifically related to storing state in the backends.
So let's say an operator in a streaming job is building up some state. When it receives barriers from all of its input streams, does it store *all* of its state to the backend? I think that is what the docs [1] and paper [2] imply, but want to make sure. In other words, if the operator contains 100MB of state, and the backend is HDFS, does the operator copy all 100MB of state to HDFS during the checkpoint? Following on this example, say the operator is a global window and is storing some state for each unique key observed in the stream of messages (e.g. userId). Assume that over time, the number of observed unique keys grows, so the size of the state also grows (the window state is never purged). Is the entire operator state at the time of each checkpoint stored to the backend? So that over time, the size of the state stored for each checkpoint to the backend grows? Or is the state stored to the backend somehow just the state that changed in some way since the last checkpoint? Are old checkpoint states in the backend ever deleted / cleaned up? That is, if all of the state for checkpoint n in the backend is all that is needed to restore a failed job, then all state for all checkpoints m < n should not be needed any more, right? Can all of those old checkpoints be deleted from the backend? Does Flink do this? Thanks, Zach [1] https://ci.apache.org/projects/flink/flink-docs-release-1.0/internals/stream_checkpointing.html [2] http://arxiv.org/abs/1506.08603