Hi, yesterday when I was creating a savepoint (to S3, around 8GB of state)
using 2 TaskManager (8 GB) and it failed because one of the task
managers fill up the disk (probably didn't have enough RAM to save the
state into S3 directly,I don't know what was the disk space, and reached
100% usage space and the other one reached 99%).

After the crash, the task manager that reach 100% deleted the "failed
savepoint" from the local disk but the other one that reached 99% kept it.
Shouldn't this task manager also clean up the failed state?

After cleaning up the disk of that task manager, I've increased the
parallelism to 6, created a new state of 8GB and all went smoothly, but it
took 8 minutes to start processing in the new job created with the previous
savepoint.

[image: flink_grafana.png]
Here is the network IO from the 6 task managers used and I have a few
questions:

- Isn't 25 Mbps of average speed a bit low? What could be the limitation?
- For 8 GB of state, gives around 7 minutes to download it [ 8000 MB
/(25Mbps/8*6 task managers)/60 seconds ], that should match the consistent
part of 7/8 minute graph, and then started reading from Kafka topic.
- Can I mitigate this with task local recovery [1]? Or is this only for
a checkpoint ?
- We are using *m5.xlarge* (4 vcpu, 16GB RAM) with 2 slots per TM.

Thanks,
David

[1]
https://ci.apache.org/projects/flink/flink-docs-stable/ops/state/large_state_tuning.html#task-local-recovery

Reply via email to