Hi, While restoring from the latest checkpoint starts immediately after the job is restarted, restoring from a savepoint takes more than five minutes until the job makes progress. During the blackout, I cannot observe any resource usage over the cluster. After that period of time, I observe that Flink tries to catch up with the progress in the source topic via various metrics including flink_taskmanager_job_task_currentLowWatermark.
FYI, I'm using - Flink-1.4.2 - FsStateBackend configured with HDFS - EventTime with BoundedOutOfOrdernessTimestampExtractor The size of an instance of checkpoint/savepoint is ~50GB and we have 7 servers for taskmanagers. Best, - Dongwon