Sweet, thanks Aljoscha for the quick help. Gyula
Aljoscha Krettek <aljos...@apache.org> ezt írta (időpont: 2017. aug. 10., Cs, 15:33): > Don't worry! :-) I found that you can configure this via > "high-availability.job.delay" > (in HighAvailabilityOptions). > > Best, > Aljoscha > > On 10. Aug 2017, at 15:13, Gyula Fóra <gyula.f...@gmail.com> wrote: > > Here is actually the whole log for the relevant parts at least: > https://gist.github.com/gyfora/b70dd18c048b862751b194f613514300 > > Sorry for not pasting it earlier. > > Gyula > > Gyula Fóra <gyula.f...@gmail.com> ezt írta (időpont: 2017. aug. 10., Cs, > 15:04): > >> Oh, I found this in the log that seems to explain it: 2017-08-10 >> 13:13:56,795 INFO org.apache.flink.yarn.YarnJobManager - Delaying recovery >> of all jobs by 120000 milliseconds. >> >> I wonder why is this... >> >> Aljoscha Krettek <aljos...@apache.org> ezt írta (időpont: 2017. aug. >> 10., Cs, 14:41): >> >>> Hi, >>> >>> Let me also investigate that? Did you observe this in 1.3.2 and not in >>> 1.3.0 and/or 1.3.1 or did you directly go from 1.2.x to 1.3.2? >>> >>> Best, >>> Aljoscha >>> >>> On 10. Aug 2017, at 13:31, Gyula Fóra <gyula.f...@gmail.com> wrote: >>> >>> Hi! >>> In some cases it seems to take a long time for the job to start the >>> zookeeper based job recovery after recovering from a JM failure. >>> Looking at the logs there is a 2 minute gap between the last recovered >>> TM was started successfully and the job recovery: >>> >>> 2017-08-10 13:14:06,369 INFO >>> org.apache.flink.runtime.instance.InstanceManager - Registered >>> TaskManager at splat31 >>> (akka.tcp://fl...@splat31.sto.midasplayer.com:41821/user/taskmanager) as >>> 2de2376edd4670b48600f29df9e821a1. Current number of registered hosts is 15. >>> Current number of alive task slots is 120. >>> 2017-08-10 13:14:08,198 INFO >>> org.apache.flink.yarn.YarnFlinkResourceManager - TaskManager >>> container_e22_1495032222473_0939_02_000002 has started. >>> 2017-08-10 13:14:08,198 INFO >>> org.apache.flink.runtime.instance.InstanceManager - Registered >>> TaskManager at splat19 >>> (akka.tcp://fl...@splat19.sto.midasplayer.com:36377/user/taskmanager) as >>> 06ed7098060f4d539150c1c26a32ac07. Current number of registered hosts is 16. >>> Current number of alive task slots is 128. >>> 2017-08-10 13:14:56,710 INFO kafka.client.ClientUtils$ >>> - Fetching metadata from broker >>> id:0,host:kafka10.sto.midasplayer.com,port:9092 with correlation id 0 for 1 >>> topic(s) Set(meter.rbea.log) >>> 2017-08-10 13:14:56,714 INFO kafka.producer.SyncProducer >>> - Connected to kafka10.sto.midasplayer.com:9092 for >>> producing >>> 2017-08-10 13:14:56,734 INFO kafka.producer.SyncProducer >>> - Disconnecting from kafka10.sto.midasplayer.com:9092 >>> 2017-08-10 13:14:56,780 INFO kafka.producer.SyncProducer >>> - Connected to kafka10.sto.midasplayer.com:9092 for >>> producing >>> 2017-08-10 13:14:56,791 INFO kafka.producer.SyncProducer >>> - Connected to kafka21.sto.midasplayer.com:9092 for >>> producing >>> 2017-08-10 13:14:56,796 INFO kafka.producer.SyncProducer >>> - Connected to kafka11.sto.midasplayer.com:9092 for >>> producing >>> 2017-08-10 13:14:56,800 INFO kafka.producer.SyncProducer >>> - Connected to kafka09.sto.midasplayer.com:9092 for >>> producing >>> 2017-08-10 13:14:56,806 INFO kafka.producer.SyncProducer >>> - Connected to kafka23.sto.midasplayer.com:9092 for >>> producing >>> 2017-08-10 13:14:56,826 INFO kafka.producer.SyncProducer >>> - Connected to kafka22.sto.midasplayer.com:9092 for >>> producing >>> 2017-08-10 13:14:56,830 INFO kafka.producer.SyncProducer >>> - Connected to kafka08.sto.midasplayer.com:9092 for >>> producing >>> 2017-08-10 13:14:56,845 INFO kafka.producer.SyncProducer >>> - Connected to kafka20.sto.midasplayer.com:9092 for >>> producing >>> 2017-08-10 13:14:56,852 INFO kafka.producer.SyncProducer >>> - Connected to kafka15.sto.midasplayer.com:9092 for >>> producing >>> 2017-08-10 13:14:56,857 INFO kafka.producer.SyncProducer >>> - Connected to kafka17.sto.midasplayer.com:9092 for >>> producing >>> 2017-08-10 13:14:56,861 INFO kafka.producer.SyncProducer >>> - Connected to kafka18.sto.midasplayer.com:9092 for >>> producing >>> 2017-08-10 13:15:56,814 INFO org.apache.flink.yarn.YarnJobManager >>> - Attempting to recover all jobs. >>> 2017-08-10 13:15:56,820 INFO org.apache.flink.yarn.YarnJobManager >>> - There are 1 jobs to recover. Starting the job recovery. >>> 2017-08-10 13:15:56,821 INFO org.apache.flink.yarn.YarnJobManager >>> - Attempting to recover job >>> fea2eb82ef98849be2a8ac1731379193. >>> 2017-08-10 13:15:56,886 INFO kafka.producer.SyncProducer >>> - Connected to kafka13.sto.midasplayer.com:9092 for >>> producing >>> 2017-08-10 13:15:56,889 INFO kafka.producer.SyncProducer >>> - Connected to kafka14.sto.midasplayer.com:9092 for >>> producing >>> 2017-08-10 13:15:56,943 INFO kafka.producer.SyncProducer >>> - Connected to kafka12.sto.midasplayer.com:9092 for >>> producing >>> >>> Disregard the kafka logs, they belong to a metrics reporter. >>> >>> Any idea what might cause the delay here? What is happenning in the >>> background? >>> Thanks, >>> Gyula >>> >>> >>> >