Hi All, Any pointers on the below checkpoint failure scenario. Appreciate all the help. Thanks
Thanks On Sun, Jul 7, 2019 at 9:23 PM Navneeth Krishnan <reachnavnee...@gmail.com> wrote: > Hi All, > > Occasionally I run into failed checkpoints error where 2 or 3 consecutive > checkpoints fails after running for a minute and then it recovers. This is > causing delay in processing the incoming data since there is huge amount of > data buffered during the failed checkpoints. I don't see any errors in the > taskmanager logs but here is the error in the jobmanager log. The state > size is around 100 mb. > > *Checkpoint configuration:* > Option Value > Checkpointing Mode Exactly Once > Interval 1m 0s > Timeout 1m 0s > Minimum Pause Between Checkpoints 5s > Maximum Concurrent Checkpoints 1 > Persist Checkpoints Externally Enabled (retain on cancellation) > *Jobmanager Log:* > > 2019-07-05 17:53:54,125 [flink-akka.actor.default-dispatcher-465901] WARN > o.a.f.r.c.CheckpointCoordinator - Received late message for now expired > checkpoint attempt 9867 from 79515b6550d2c223701be0a9c870995f of job > 00ff93caa4cc9464bd41e1d050fcf65c. > 2019-07-05 17:53:54,141 [flink-akka.actor.default-dispatcher-465901] WARN > o.a.f.r.c.CheckpointCoordinator - Received late message for now expired > checkpoint attempt 9867 from 630984cdd5e66b4d9ea95a91cb4d23f6 of job > 00ff93caa4cc9464bd41e1d050fcf65c. > 2019-07-05 17:53:54,168 [flink-akka.actor.default-dispatcher-465901] WARN > o.a.f.r.c.CheckpointCoordinator - Received late message for now expired > checkpoint attempt 9867 from e12ed2e185a37559f93181905a52ebeb of job > 00ff93caa4cc9464bd41e1d050fcf65c. > 2019-07-05 17:53:54,215 [flink-akka.actor.default-dispatcher-465901] WARN > o.a.f.r.c.CheckpointCoordinator - Received late message for now expired > checkpoint attempt 9867 from 1fede192e2ff11e0905d98ff5ff6f9ce of job > 00ff93caa4cc9464bd41e1d050fcf65c. > 2019-07-05 17:53:54,223 [flink-akka.actor.default-dispatcher-465901] WARN > o.a.f.r.c.CheckpointCoordinator - Received late message for now expired > checkpoint attempt 9867 from d4e895eb20cc259c95b249cd0252930f of job > 00ff93caa4cc9464bd41e1d050fcf65c. > 2019-07-05 17:53:54,310 [flink-akka.actor.default-dispatcher-465901] WARN > o.a.f.r.c.CheckpointCoordinator - Received late message for now expired > checkpoint attempt 9867 from be5c711d7b37ed6d8022224dc447db91 of job > 00ff93caa4cc9464bd41e1d050fcf65c. > 2019-07-05 17:53:54,351 [flink-akka.actor.default-dispatcher-465901] WARN > o.a.f.r.c.CheckpointCoordinator - Received late message for now expired > checkpoint attempt 9867 from 1ed52695cc407f2f143d2bb5d23cbdbb of job > 00ff93caa4cc9464bd41e1d050fcf65c. > 2019-07-05 17:53:54,398 [flink-akka.actor.default-dispatcher-465901] WARN > o.a.f.r.c.CheckpointCoordinator - Received late message for now expired > checkpoint attempt 9867 from 2e43cf968ad399c0b8426239a7dd081c of job > 00ff93caa4cc9464bd41e1d050fcf65c. > 2019-07-05 17:53:54,959 [flink-akka.actor.default-dispatcher-465868] INFO > o.a.f.r.c.CheckpointCoordinator - Completed checkpoint 9868 (279307042 > bytes in 50707 ms). > 2019-07-05 17:54:04,174 [Checkpoint Timer] INFO > o.a.f.r.c.CheckpointCoordinator - Triggering checkpoint 9869 @ 1562349244171 > 2019-07-05 17:54:10,709 [flink-akka.actor.default-dispatcher-465905] INFO > o.a.f.r.c.CheckpointCoordinator - Completed checkpoint 9869 (253638470 > bytes in 6430 ms). > 2019-07-05 17:55:04,174 [Checkpoint Timer] INFO > o.a.f.r.c.CheckpointCoordinator - Triggering checkpoint 9870 @ 1562349304171 > 2019-07-05 17:55:09,816 [flink-akka.actor.default-dispatcher-465913] INFO > o.a.f.r.c.CheckpointCoordinator - Completed checkpoint 9870 (138649543 > bytes in 5551 ms). > 2019-07-05 17:56:04,174 [Checkpoint Timer] INFO > o.a.f.r.c.CheckpointCoordinator - Triggering checkpoint 9871 @ 1562349364171 > > Thanks >