Can you provide us the jobmanager logs?
After the first restart the JM should have started deleting older
checkpoints as new ones were created.
After the second restart the JM should have recovered all 10
checkpoints, start from the latest, and start pruning old ones as new
ones were created.
So you're running into 2 separate issues here, which is a bit odd.
On 05/06/2019 13:44, Vishal Santoshi wrote:
Any one?
On Tue, Jun 4, 2019, 2:41 PM Vishal Santoshi
<vishal.santo...@gmail.com <mailto:vishal.santo...@gmail.com>> wrote:
The above is flink 1.8
On Tue, Jun 4, 2019 at 12:32 PM Vishal Santoshi
<vishal.santo...@gmail.com <mailto:vishal.santo...@gmail.com>> wrote:
I had a sequence of events that created this issue.
* I started a job and I had the state.checkpoints.num-retained: 5
* As expected I have 5 latest checkpoints retained in my hdfs
backend.
* JM dies ( K8s limit etc ) without cleaning the hdfs
directory. The k8s job restores from the latest checkpoint (
I think ) but as it creates new checkpoints it does not delete
the older chk point. At the end there are now 10 chkpoints, 5
from the old run which remain static and 5 latest representing
the on going pipe.
* The JM dies again and restart from the latest from the 5
old checkpoints.
This looks a bug in the Job Cluster implementation of flink.
It looks like it is taking the 5th checkpoint from the
beginning based on num-retained value, Note that it has the
same job id and does not scope to a new directory.
https://github.com/apache/flink/blob/1dfdaa417ab7cdca9bef1efe6381c7eb67022aaf/flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/ZooKeeperCompletedCheckpointStore.java#L109
Please tell me if this does not make sense.
Vishal