Hi, thanks for the information. Unfortunately, I have no immediate idea what the reason is from the given information. I think most helpful could be a thread dump, but also metrics on the operator operator level to figure out which part of the pipeline is the culprit.
Best, Stefan > Am 26.09.2017 um 17:55 schrieb Tony Wei <tony19920...@gmail.com>: > > Hi Stefan, > > There is no unknown exception in my full log. The Flink version is 1.3.2. > My job is roughly like this. > > env.addSource(Kafka) > .map(ParseKeyFromRecord) > .keyBy() > .process(CountAndTimeoutWindow) > .asyncIO(UploadToS3) > .addSink(UpdateDatabase) > > It seemed all tasks stopped like the picture I sent in the last email. > > I will keep my eye on taking a thread dump from that JVM if this happens > again. > > Best Regards, > Tony Wei > > 2017-09-26 23:46 GMT+08:00 Stefan Richter <s.rich...@data-artisans.com > <mailto:s.rich...@data-artisans.com>>: > Hi, > > that is very strange indeed. I had a look at the logs and there is no error > or exception reported. I assume there is also no exception in your full logs? > Which version of flink are you using and what operators were running in the > task that stopped? If this happens again, would it be possible to take a > thread dump from that JVM? > > Best, > Stefan > > > Am 26.09.2017 um 17:08 schrieb Tony Wei <tony19920...@gmail.com > > <mailto:tony19920...@gmail.com>>: > > > > Hi, > > > > Something weird happened on my streaming job. > > > > I found my streaming job seems to be blocked for a long time and I saw the > > situation like the picture below. (chk #1245 and #1246 were all finishing > > 7/8 tasks then marked timeout by JM. Other checkpoints failed with the same > > state like #1247 util I restarted TM.) > > > > <snapshot.png> > > > > I'm not sure what happened, but the consumer stopped fetching records, > > buffer usage is 100% and the following task did not seem to fetch data > > anymore. Just like the whole TM was stopped. > > > > However, after I restarted TM and force the job restarting from the latest > > completed checkpoint, everything worked again. And I don't know how to > > reproduce it. > > > > The attachment is my TM log. Because there are many user logs and sensitive > > information, I only remain the log from `org.apache.flink...`. > > > > My cluster setting is one JM and one TM with 4 available slots. > > > > Streaming job uses all slots, checkpoint interval is 5 mins and max > > concurrent number is 3. > > > > Please let me know if it needs more information to find out what happened > > on my streaming job. Thanks for your help. > > > > Best Regards, > > Tony Wei > > <flink-root-taskmanager-0-partial.log> > >