Ok, I will do that. On Wed, Jun 5, 2019, 8:25 AM Chesnay Schepler <ches...@apache.org> wrote:
> Can you provide us the jobmanager logs? > > After the first restart the JM should have started deleting older > checkpoints as new ones were created. > After the second restart the JM should have recovered all 10 checkpoints, > start from the latest, and start pruning old ones as new ones were created. > > So you're running into 2 separate issues here, which is a bit odd. > > On 05/06/2019 13:44, Vishal Santoshi wrote: > > Any one? > > On Tue, Jun 4, 2019, 2:41 PM Vishal Santoshi <vishal.santo...@gmail.com> > wrote: > >> The above is flink 1.8 >> >> On Tue, Jun 4, 2019 at 12:32 PM Vishal Santoshi < >> vishal.santo...@gmail.com> wrote: >> >>> I had a sequence of events that created this issue. >>> >>> * I started a job and I had the state.checkpoints.num-retained: 5 >>> >>> * As expected I have 5 latest checkpoints retained in my hdfs backend. >>> >>> >>> * JM dies ( K8s limit etc ) without cleaning the hdfs directory. The >>> k8s job restores from the latest checkpoint ( I think ) but as it creates >>> new checkpoints it does not delete the older chk point. At the end there >>> are now 10 chkpoints, 5 from the old run which remain static and 5 latest >>> representing the on going pipe. >>> >>> * The JM dies again and restart from the latest from the 5 old >>> checkpoints. >>> >>> This looks a bug in the Job Cluster implementation of flink. It looks >>> like it is taking the 5th checkpoint from the beginning based on >>> num-retained value, Note that it has the same job id and does not scope to >>> a new directory. >>> >>> >>> https://github.com/apache/flink/blob/1dfdaa417ab7cdca9bef1efe6381c7eb67022aaf/flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/ZooKeeperCompletedCheckpointStore.java#L109 >>> >>> Please tell me if this does not make sense. >>> >>> Vishal >>> >>> >>> >>> >>> >>> >