even though Max. number of execution retries Restart with fixed delay (240000 ms). #20 restart attempts.
On Sat, Jun 29, 2019 at 10:44 AM Vishal Santoshi <vishal.santo...@gmail.com> wrote: > This is strange, the retry strategy was 20 times with 4 minute delay. > This job tried once ( we had a hadoop Name Node hiccup ) but I think it > could not even get to NN and gave up ( as in did not retry the next 19 > times ) .... > > *019-06-29 00:33:13,680 INFO > org.apache.flink.runtime.executiongraph.ExecutionGraph - Could not restart > the job Kafka-to-HDFS (00000000000000000000000000000005) because the > restart strategy prevented it.* > > On Sat, Jun 29, 2019 at 10:03 AM Vishal Santoshi < > vishal.santo...@gmail.com> wrote: > >> We are investigating that. But is the above theory plausible ( flink >> gurus ) even though this, as in forcing restartPolicy: Never pretty much >> nullifies HA on JM is it is a Job cluster ( at leats on k8s ) >> >> >> As for the reason we are investigating that. >> >> One thing we looking as the QOS ( >> https://kubernetes.io/docs/tasks/configure-pod-container/quality-service-pod/ >> ) >> , which was Burstable, a limit was not set on the job. The below for >> example would have a QOS of guaranteed. So one reason could be this. >> >> resources: >> requests: >> cpu: "8.0" >> memory: "16Gi" >> limits: >> cpu: "8.0" >> memory: "16Gi" >> >> >> >> >> >> On Sat, Jun 29, 2019 at 9:35 AM Timothy Victor <vict...@gmail.com> wrote: >> >>> This is slightly off topic, so I'm changing the subject to not conflate >>> the original issue you brought up. But do we know why JM crashed in the >>> first place? >>> >>> We are also thinking of moving to K8s, but to be honest we had tons of >>> stability issues in our first rodeo. That could just be our lack of a >>> deeper understanding of K8s (a lot had to do with DNS). But cases like >>> this make me wonder what happened to cause the crash in the first place. >>> >>> Thanks >>> >>> Tim >>> >>> >>> On Sat, Jun 29, 2019, 8:25 AM Vishal Santoshi <vishal.santo...@gmail.com> >>> wrote: >>> >>>> Another point the JM had terminated. The policy on K8s for Job Cluster >>>> is >>>> >>>> spec: >>>> restartPolicy: OnFailure >>>> >>>> >>>> *2019-06-29 00:33:14,308 INFO >>>> org.apache.flink.runtime.entrypoint.ClusterEntrypoint - Terminating cluster >>>> entrypoint process StandaloneJobClusterEntryPoint with exit code 1443.* >>>> >>>> >>>> >>>> On Sat, Jun 29, 2019 at 9:04 AM Vishal Santoshi < >>>> vishal.santo...@gmail.com> wrote: >>>> >>>>> I have not tried on bare metal. We have no option but k8s. >>>>> >>>>> And this is a job cluster. >>>>> >>>>> On Sat, Jun 29, 2019 at 9:01 AM Timothy Victor <vict...@gmail.com> >>>>> wrote: >>>>> >>>>>> Hi Vishal, can this be reproduced on a bare metal instance as well? >>>>>> Also is this a job or a session cluster? >>>>>> >>>>>> Thanks >>>>>> >>>>>> Tim >>>>>> >>>>>> On Sat, Jun 29, 2019, 7:50 AM Vishal Santoshi < >>>>>> vishal.santo...@gmail.com> wrote: >>>>>> >>>>>>> OK this happened again and it is bizarre ( and is definitely not >>>>>>> what I think should happen ) >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> The job failed and I see these logs ( In essence it is keeping the >>>>>>> last 5 externalized checkpoints ) but deleting the zk checkpoints >>>>>>> directory >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> *06.28.2019 20:33:13.738 2019-06-29 00:33:13,736 INFO >>>>>>> org.apache.flink.runtime.checkpoint.CompletedCheckpoint - >>>>>>> Checkpoint >>>>>>> with ID 5654 at >>>>>>> 'xxxxxxxxxx:8020/analytics_eng/kafka-to-hdfs-states/00000000000000000000000000000005/chk-5654' >>>>>>> not discarded. 06.28.2019 20:33:13.788 2019-06-29 00:33:13,786 >>>>>>> INFO >>>>>>> org.apache.flink.runtime.checkpoint.CompletedCheckpoint - >>>>>>> Checkpoint >>>>>>> with ID 5655 at >>>>>>> 'xxxxxxxxxx:8020/analytics_eng/kafka-to-hdfs-states/00000000000000000000000000000005/chk-5655' >>>>>>> not discarded. 06.28.2019 20:33:13.838 2019-06-29 00:33:13,836 >>>>>>> INFO >>>>>>> org.apache.flink.runtime.checkpoint.CompletedCheckpoint - >>>>>>> Checkpoint >>>>>>> with ID 5656 at >>>>>>> 'xxxxxxxxxx8020/analytics_eng/kafka-to-hdfs-states/00000000000000000000000000000005/chk-5656' >>>>>>> not discarded. 06.28.2019 20:33:13.888 2019-06-29 00:33:13,886 >>>>>>> INFO >>>>>>> org.apache.flink.runtime.checkpoint.CompletedCheckpoint - >>>>>>> Checkpoint >>>>>>> with ID 5657 at >>>>>>> 'xxxxxxxxxx:8020/analytics_eng/kafka-to-hdfs-states/00000000000000000000000000000005/chk-5657' >>>>>>> not discarded. 06.28.2019 20:33:13.938 2019-06-29 00:33:13,936 >>>>>>> INFO >>>>>>> org.apache.flink.runtime.checkpoint.CompletedCheckpoint - >>>>>>> Checkpoint >>>>>>> with ID 5658 at >>>>>>> 'xxxxxxxxxx8020/analytics_eng/kafka-to-hdfs-states/00000000000000000000000000000005/chk-5658' >>>>>>> not discarded. 06.28.2019 20:33:13.938 2019-06-29 00:33:13,936 >>>>>>> INFO >>>>>>> org.apache.flink.runtime.zookeeper.ZooKeeperStateHandleStore - >>>>>>> Removing >>>>>>> /kafka-to-hdfs-v2/kafka-to-hdfs-v2/k8s/checkpoints/00000000000000000000000000000005 >>>>>>> from ZooKeeper* >>>>>>> >>>>>>> The job restarts and this is bizzare. It does not find the ZK >>>>>>> checkpoint directory but instead of going to the >>>>>>> state.checkpoints.dir to get it's last checkpoint, it restarts from >>>>>>> a save point that we started this job with ( resetting the checkpoint >>>>>>> id ) >>>>>>> like about 15 days ago >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> * 06.28.2019 20:33:20.047 2019-06-29 00:33:20,045 INFO >>>>>>> org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore >>>>>>> - >>>>>>> Recovering checkpoints from ZooKeeper. 06.28.2019 20:33:20.053 >>>>>>> 2019-06-29 00:33:20,051 INFO >>>>>>> org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore >>>>>>> - >>>>>>> Trying to fetch 0 checkpoints from storage. 06.28.2019 20:33:20.053 >>>>>>> 2019-06-29 00:33:20,051 INFO >>>>>>> org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore >>>>>>> - >>>>>>> Found 0 checkpoints in ZooKeeper. 06.28.2019 20:33:20.054 >>>>>>> 2019-06-29 >>>>>>> 00:33:20,053 INFO >>>>>>> org.apache.flink.runtime.checkpoint.CheckpointCoordinator - >>>>>>> Starting >>>>>>> job 00000000000000000000000000000005 from savepoint >>>>>>> hdfs://nn-crunchy:8020/flink-savepoints_k8s/prod/kafka-to-hdfs/savepoint-000000-128f419cdc6f >>>>>>> () 06.28.2019 20:33:20.540 2019-06-29 00:33:20,538 INFO >>>>>>> org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Reset >>>>>>> the >>>>>>> checkpoint ID of job 00000000000000000000000000000005 to 4203. >>>>>>> 06.28.2019 20:33:20.540 2019-06-29 00:33:20,538 INFO >>>>>>> org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore >>>>>>> - >>>>>>> Recovering checkpoints from ZooKeeper. 06.28.2019 20:33:20.550 >>>>>>> 2019-06-29 00:33:20,549 INFO >>>>>>> org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore >>>>>>> - >>>>>>> Trying to retrieve checkpoint 4202. 06.28.2019 20:33:20.550 >>>>>>> 2019-06-29 00:33:20,548 INFO >>>>>>> org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore >>>>>>> - >>>>>>> Found 1 checkpoints in ZooKeeper. 06.28.2019 20:33:20.550 >>>>>>> 2019-06-29 >>>>>>> 00:33:20,549 INFO >>>>>>> org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore >>>>>>> - >>>>>>> Trying to fetch 1 checkpoints from storage.* >>>>>>> >>>>>>> >>>>>>> >>>>>>> This just does not make sense.... >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Wed, Jun 5, 2019 at 9:29 AM Vishal Santoshi < >>>>>>> vishal.santo...@gmail.com> wrote: >>>>>>> >>>>>>>> Ok, I will do that. >>>>>>>> >>>>>>>> On Wed, Jun 5, 2019, 8:25 AM Chesnay Schepler <ches...@apache.org> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> Can you provide us the jobmanager logs? >>>>>>>>> >>>>>>>>> After the first restart the JM should have started deleting older >>>>>>>>> checkpoints as new ones were created. >>>>>>>>> After the second restart the JM should have recovered all 10 >>>>>>>>> checkpoints, start from the latest, and start pruning old ones as new >>>>>>>>> ones >>>>>>>>> were created. >>>>>>>>> >>>>>>>>> So you're running into 2 separate issues here, which is a bit odd. >>>>>>>>> >>>>>>>>> On 05/06/2019 13:44, Vishal Santoshi wrote: >>>>>>>>> >>>>>>>>> Any one? >>>>>>>>> >>>>>>>>> On Tue, Jun 4, 2019, 2:41 PM Vishal Santoshi < >>>>>>>>> vishal.santo...@gmail.com> wrote: >>>>>>>>> >>>>>>>>>> The above is flink 1.8 >>>>>>>>>> >>>>>>>>>> On Tue, Jun 4, 2019 at 12:32 PM Vishal Santoshi < >>>>>>>>>> vishal.santo...@gmail.com> wrote: >>>>>>>>>> >>>>>>>>>>> I had a sequence of events that created this issue. >>>>>>>>>>> >>>>>>>>>>> * I started a job and I had the state.checkpoints.num-retained: >>>>>>>>>>> 5 >>>>>>>>>>> >>>>>>>>>>> * As expected I have 5 latest checkpoints retained in my hdfs >>>>>>>>>>> backend. >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> * JM dies ( K8s limit etc ) without cleaning the hdfs >>>>>>>>>>> directory. The k8s job restores from the latest checkpoint ( I >>>>>>>>>>> think ) >>>>>>>>>>> but as it creates new checkpoints it does not delete the older chk >>>>>>>>>>> point. >>>>>>>>>>> At the end there are now 10 chkpoints, 5 from the old run which >>>>>>>>>>> remain >>>>>>>>>>> static and 5 latest representing the on going pipe. >>>>>>>>>>> >>>>>>>>>>> * The JM dies again and restart from the latest from the 5 old >>>>>>>>>>> checkpoints. >>>>>>>>>>> >>>>>>>>>>> This looks a bug in the Job Cluster implementation of flink. It >>>>>>>>>>> looks like it is taking the 5th checkpoint from the beginning based >>>>>>>>>>> on >>>>>>>>>>> num-retained value, Note that it has the same job id and does not >>>>>>>>>>> scope to >>>>>>>>>>> a new directory. >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> https://github.com/apache/flink/blob/1dfdaa417ab7cdca9bef1efe6381c7eb67022aaf/flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/ZooKeeperCompletedCheckpointStore.java#L109 >>>>>>>>>>> >>>>>>>>>>> Please tell me if this does not make sense. >>>>>>>>>>> >>>>>>>>>>> Vishal >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>