Re: Stream Task seems to be blocked after checkpoint timeout

Stefan Richter Wed, 27 Sep 2017 01:18:30 -0700

Hi,

thanks for the information. Unfortunately, I have no immediate idea what the 
reason is from the given information. I think most helpful could be a thread 
dump, but also metrics on the operator operator level to figure out which part 
of the pipeline is the culprit.


Best,
Stefan

> Am 26.09.2017 um 17:55 schrieb Tony Wei <tony19920...@gmail.com>:
> 
> Hi Stefan,
> 
> There is no unknown exception in my full log. The Flink version is 1.3.2.
> My job is roughly like this.
> 
> env.addSource(Kafka)
>   .map(ParseKeyFromRecord)
>   .keyBy()
>   .process(CountAndTimeoutWindow)
>   .asyncIO(UploadToS3)
>   .addSink(UpdateDatabase)
> 
> It seemed all tasks stopped like the picture I sent in the last email.
> 
> I will keep my eye on taking a thread dump from that JVM if this happens 
> again.
> 
> Best Regards,
> Tony Wei
> 
> 2017-09-26 23:46 GMT+08:00 Stefan Richter <s.rich...@data-artisans.com 
> <mailto:s.rich...@data-artisans.com>>:
> Hi,
> 
> that is very strange indeed. I had a look at the logs and there is no error 
> or exception reported. I assume there is also no exception in your full logs? 
> Which version of flink are you using and what operators were running in the 
> task that stopped? If this happens again, would it be possible to take a 
> thread dump from that JVM?
> 
> Best,
> Stefan
> 
> > Am 26.09.2017 um 17:08 schrieb Tony Wei <tony19920...@gmail.com 
> > <mailto:tony19920...@gmail.com>>:
> >
> > Hi,
> >
> > Something weird happened on my streaming job.
> >
> > I found my streaming job seems to be blocked for a long time and I saw the 
> > situation like the picture below. (chk #1245 and #1246 were all finishing 
> > 7/8 tasks then marked timeout by JM. Other checkpoints failed with the same 
> > state like #1247 util I restarted TM.)
> >
> > <snapshot.png>
> >
> > I'm not sure what happened, but the consumer stopped fetching records, 
> > buffer usage is 100% and the following task did not seem to fetch data 
> > anymore. Just like the whole TM was stopped.
> >
> > However, after I restarted TM and force the job restarting from the latest 
> > completed checkpoint, everything worked again. And I don't know how to 
> > reproduce it.
> >
> > The attachment is my TM log. Because there are many user logs and sensitive 
> > information, I only remain the log from `org.apache.flink...`.
> >
> > My cluster setting is one JM and one TM with 4 available slots.
> >
> > Streaming job uses all slots, checkpoint interval is 5 mins and max 
> > concurrent number is 3.
> >
> > Please let me know if it needs more information to find out what happened 
> > on my streaming job. Thanks for your help.
> >
> > Best Regards,
> > Tony Wei
> > <flink-root-taskmanager-0-partial.log>
> 
>

Re: Stream Task seems to be blocked after checkpoint timeout

Reply via email to