Hi Tony, are your checkpoints typically close to the timeout boundary? From what I see, writing the checkpoint is relatively fast but the time from the checkpoint trigger to execution seems very long. This is typically the case if your job has a lot of backpressure and therefore the checkpoint barriers take a long time to travel to the operators, because a lot of events are piling up in the buffers. Do you also experience large alignments for your checkpoints?
Best, Stefan > Am 27.09.2017 um 10:43 schrieb Tony Wei <tony19920...@gmail.com>: > > Hi Stefan, > > It seems that I found something strange from JM's log. > > It had happened more than once before, but all subtasks would finish their > checkpoint attempts in the end. > > 2017-09-26 01:23:28,690 INFO > org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Triggering > checkpoint 1140 @ 1506389008690 > 2017-09-26 01:28:28,690 INFO > org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Triggering > checkpoint 1141 @ 1506389308690 > 2017-09-26 01:33:28,690 INFO > org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Triggering > checkpoint 1142 @ 1506389608690 > 2017-09-26 01:33:28,691 INFO > org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Checkpoint 1140 > expired before completing. > 2017-09-26 01:38:28,691 INFO > org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Checkpoint 1141 > expired before completing. > 2017-09-26 01:40:38,044 WARN > org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Received late > message for now expired checkpoint attempt 1140 from > c63825d15de0fef55a1d148adcf4467e of job 7c039572b... > 2017-09-26 01:40:53,743 WARN > org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Received late > message for now expired checkpoint attempt 1141 from > c63825d15de0fef55a1d148adcf4467e of job 7c039572b... > 2017-09-26 01:41:19,332 INFO > org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Completed > checkpoint 1142 (136733704 bytes in 457413 ms). > > For chk #1245 and #1246, there was no late message from TM. You can refer to > the TM log. The full completed checkpoint attempt will have 12 (... > asynchronous part) logs in general, but #1245 and #1246 only got 10 logs. > > 2017-09-26 10:08:28,690 INFO > org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Triggering > checkpoint 1245 @ 1506420508690 > 2017-09-26 10:13:28,690 INFO > org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Triggering > checkpoint 1246 @ 1506420808690 > 2017-09-26 10:18:28,691 INFO > org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Checkpoint 1245 > expired before completing. > 2017-09-26 10:23:28,691 INFO > org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Checkpoint 1246 > expired before completing. > > Moreover, I listed the directory for checkpoints on S3 and saw there were two > states not discarded successfully. In general, there will be 16 parts for a > completed checkpoint state. > > 2017-09-26 18:08:33 36919 > tony-dev/flink-checkpoints/7c039572b13346f1b17dcc0ace2b72c2/chk-1245/eedd7ca5-ee34-45a5-bf0b-11cc1fc67ab8 > 2017-09-26 18:13:34 37419 > tony-dev/flink-checkpoints/7c039572b13346f1b17dcc0ace2b72c2/chk-1246/9aa5c6c4-8c74-465d-8509-5fea4ed25af6 > > Hope these informations are helpful. Thank you. > > Best Regards, > Tony Wei > > 2017-09-27 16:14 GMT+08:00 Stefan Richter <s.rich...@data-artisans.com > <mailto:s.rich...@data-artisans.com>>: > Hi, > > thanks for the information. Unfortunately, I have no immediate idea what the > reason is from the given information. I think most helpful could be a thread > dump, but also metrics on the operator operator level to figure out which > part of the pipeline is the culprit. > > Best, > Stefan > >> Am 26.09.2017 um 17:55 schrieb Tony Wei <tony19920...@gmail.com >> <mailto:tony19920...@gmail.com>>: >> >> Hi Stefan, >> >> There is no unknown exception in my full log. The Flink version is 1.3.2. >> My job is roughly like this. >> >> env.addSource(Kafka) >> .map(ParseKeyFromRecord) >> .keyBy() >> .process(CountAndTimeoutWindow) >> .asyncIO(UploadToS3) >> .addSink(UpdateDatabase) >> >> It seemed all tasks stopped like the picture I sent in the last email. >> >> I will keep my eye on taking a thread dump from that JVM if this happens >> again. >> >> Best Regards, >> Tony Wei >> >> 2017-09-26 23:46 GMT+08:00 Stefan Richter <s.rich...@data-artisans.com >> <mailto:s.rich...@data-artisans.com>>: >> Hi, >> >> that is very strange indeed. I had a look at the logs and there is no error >> or exception reported. I assume there is also no exception in your full >> logs? Which version of flink are you using and what operators were running >> in the task that stopped? If this happens again, would it be possible to >> take a thread dump from that JVM? >> >> Best, >> Stefan >> >> > Am 26.09.2017 um 17:08 schrieb Tony Wei <tony19920...@gmail.com >> > <mailto:tony19920...@gmail.com>>: >> > >> > Hi, >> > >> > Something weird happened on my streaming job. >> > >> > I found my streaming job seems to be blocked for a long time and I saw the >> > situation like the picture below. (chk #1245 and #1246 were all finishing >> > 7/8 tasks then marked timeout by JM. Other checkpoints failed with the >> > same state like #1247 util I restarted TM.) >> > >> > <snapshot.png> >> > >> > I'm not sure what happened, but the consumer stopped fetching records, >> > buffer usage is 100% and the following task did not seem to fetch data >> > anymore. Just like the whole TM was stopped. >> > >> > However, after I restarted TM and force the job restarting from the latest >> > completed checkpoint, everything worked again. And I don't know how to >> > reproduce it. >> > >> > The attachment is my TM log. Because there are many user logs and >> > sensitive information, I only remain the log from `org.apache.flink...`. >> > >> > My cluster setting is one JM and one TM with 4 available slots. >> > >> > Streaming job uses all slots, checkpoint interval is 5 mins and max >> > concurrent number is 3. >> > >> > Please let me know if it needs more information to find out what happened >> > on my streaming job. Thanks for your help. >> > >> > Best Regards, >> > Tony Wei >> > <flink-root-taskmanager-0-partial.log> >> >> > >