Hi, yesterday when I was creating a savepoint (to S3, around 8GB of state) using 2 TaskManager (8 GB) and it failed because one of the task managers fill up the disk (probably didn't have enough RAM to save the state into S3 directly,I don't know what was the disk space, and reached 100% usage space and the other one reached 99%).
After the crash, the task manager that reach 100% deleted the "failed savepoint" from the local disk but the other one that reached 99% kept it. Shouldn't this task manager also clean up the failed state? After cleaning up the disk of that task manager, I've increased the parallelism to 6, created a new state of 8GB and all went smoothly, but it took 8 minutes to start processing in the new job created with the previous savepoint. [image: flink_grafana.png] Here is the network IO from the 6 task managers used and I have a few questions: - Isn't 25 Mbps of average speed a bit low? What could be the limitation? - For 8 GB of state, gives around 7 minutes to download it [ 8000 MB /(25Mbps/8*6 task managers)/60 seconds ], that should match the consistent part of 7/8 minute graph, and then started reading from Kafka topic. - Can I mitigate this with task local recovery [1]? Or is this only for a checkpoint ? - We are using *m5.xlarge* (4 vcpu, 16GB RAM) with 2 slots per TM. Thanks, David [1] https://ci.apache.org/projects/flink/flink-docs-stable/ops/state/large_state_tuning.html#task-local-recovery