Re: When a jobmanager fails, it doesn't restart because it tries to restart non existing tasks

Gerard Garcia Wed, 18 Jul 2018 01:17:55 -0700

Hi vino,

Seems that jobs id stay in /jobgraphs when we cancel them manually. For
example, after cancelling the job with id 75e16686cb4fe0d33ead8e29af131d09
the entry is still in zookeeper's path /flink/default/jobgraphs, but the
job disappeared from /home/nas/flink/ha/default/blob/.

That is the client log:

09:20:58.492 [main] INFO  org.apache.flink.client.cli.CliFrontend  -
Cancelling job 75e16686cb4fe0d33ead8e29af131d09.
09:20:58.503 [main] INFO
org.apache.flink.runtime.blob.FileSystemBlobStore  - Creating highly
available BLOB storage directory at file:///home/nas/flink/ha//default/blob
09:20:58.505 [main] INFO  org.apache.flink.runtime.util.ZooKeeperUtils  -
Enforcing default ACL for ZK connections
09:20:58.505 [main] INFO  org.apache.flink.runtime.util.ZooKeeperUtils  -
Using '/flink-eur/default' as Zookeeper namespace.
09:20:58.539 [main] INFO
o.a.f.s.c.o.a.curator.framework.imps.CuratorFrameworkImpl  - Starting
09:20:58.543 [main] INFO
o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client
environment:zookeeper.version=
3.4.10-39d3a4f269333c922ed3db283be479f9deacaa0f, built on 03/23/2017 10:13
GMT
09:20:58.543 [main] INFO
o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client
environment:host.name=flink-eur-production1
09:20:58.543 [main] INFO
o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client
environment:java.version=1.8.0_131
09:20:58.544 [main] INFO
o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client
environment:java.vendor=Oracle Corporation
09:20:58.546 [main] INFO
o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client
environment:java.home=/opt/jdk/jdk1.8.0_131/jre
09:20:58.546 [main] INFO
o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client
environment:java.class.path=/opt/flink/flink-1.5.0/lib/commons-httpclient-3.1.jar:/opt/flink/flink-1.5.0/lib/flink-metrics-statsd-1.5.0.jar:/opt/flink/flink-1.5.0/lib/flink-python_2.11-1.5.0.jar:/opt/flink/flink-1.5.0/lib/fluency-1.8.0.jar:/opt/flink/flink-1.5.0/lib/gcs-connector-latest-hadoop2.jar:/opt/flink/flink-1.5.0/lib/hadoop-openstack-2.7.1.jar:/opt/flink/flink-1.5.0/lib/jackson-annotations-2.8.0.jar:/opt/flink/flink-1.5.0/lib/jackson-core-2.8.10.jar:/opt/flink/flink-1.5.0/lib/jackson-databind-2.8.11.1.jar:/opt/flink/flink-1.5.0/lib/jackson-dataformat-msgpack-0.8.15.jar:/opt/flink/flink-1.5.0/lib/log4j-1.2.17.jar:/opt/flink/flink-1.5.0/lib/log4j-over-slf4j-1.7.25.jar:/opt/flink/flink-1.5.0/lib/logback-classic-1.2.3.jar:/opt/flink/flink-1.5.0/lib/logback-core-1.2.3.jar:/opt/flink/flink-1.5.0/lib/logback-more-appenders-1.4.2.jar:/opt/flink/flink-1.5.0/lib/msgpack-0.6.12.jar:/opt/flink/flink-1.5.0/lib/msgpack-core-0.8.15.jar:/opt/flink/flink-1.5.0/lib/phi-accural-failure-detector-0.0.4.jar:/opt/flink/flink-1.5.0/lib/slf4j-log4j12-1.7.7.jar:/opt/flink/flink-1.5.0/lib/flink-dist_2.11-1.5.0.jar:::
09:20:58.546 [main] INFO
o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client
environment:java.library.path=/usr/java/packages/lib/amd64:/usr/lib64:/lib64:/lib:/usr/lib
09:20:58.546 [main] INFO
o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client
environment:java.io.tmpdir=/tmp
09:20:58.546 [main] INFO
o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client
environment:java.compiler=<NA>
09:20:58.547 [main] INFO
o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client
environment:os.name=Linux
09:20:58.547 [main] INFO
o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client
environment:os.arch=amd64
09:20:58.547 [main] INFO
o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client
environment:os.version=4.9.87-xxxx-std-ipv6-64
09:20:58.547 [main] INFO
o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client
environment:user.name=root
09:20:58.547 [main] INFO
o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client
environment:user.home=/root
09:20:58.547 [main] INFO
o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client
environment:user.dir=/opt/flink/flink-1.5.0/bin
09:20:58.548 [main] INFO
o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Initiating
client connection, connectString=10.1.1.5:2181,10.1.1.6:2181,10.1.1.7:2181
sessionTimeout=60000
watcher=org.apache.flink.shaded.curator.org.apache.curator.ConnectionState@4a003cbe
09:20:58.555 [main-SendThread(10.1.1.5:2181)] WARN
o.a.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - SASL
configuration failed: javax.security.auth.login.LoginException: No JAAS
configuration section named 'Client' was found in specified JAAS
configuration file: '/tmp/jaas-9143038863636945274.conf'. Will continue
connection to Zookeeper server without SASL authentication, if Zookeeper
server allows it.
09:20:58.556 [main-SendThread(10.1.1.5:2181)] INFO
o.a.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - Opening
socket connection to server 10.1.1.5/10.1.1.5:2181
09:20:58.556 [main-EventThread] ERROR
o.a.flink.shaded.curator.org.apache.curator.ConnectionState  -
Authentication failed
09:20:58.569 [main-SendThread(10.1.1.5:2181)] INFO
o.a.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - Socket
connection established to 10.1.1.5/10.1.1.5:2181, initiating session
09:20:58.592 [main-SendThread(10.1.1.5:2181)] INFO
o.a.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - Session
establishment complete on server 10.1.1.5/10.1.1.5:2181, sessionid =
0x100571bda1903b7, negotiated timeout = 40000
09:20:58.593 [main-EventThread] INFO
o.a.f.s.c.o.a.curator.framework.state.ConnectionStateManager  - State
change: CONNECTED
09:20:58.711 [main] INFO  org.apache.flink.runtime.rest.RestClient  - Rest
client endpoint started.
09:20:58.713 [main] INFO
o.a.f.r.leaderretrieval.ZooKeeperLeaderRetrievalService  - Starting
ZooKeeperLeaderRetrievalService /leader/rest_server_lock.
09:20:58.755 [main] INFO
o.a.f.r.leaderretrieval.ZooKeeperLeaderRetrievalService  - Starting
ZooKeeperLeaderRetrievalService /leader/dispatcher_lock.
09:20:58.946 [main] INFO  org.apache.flink.runtime.rest.RestClient  -
Shutting down rest endpoint.
09:20:58.946 [main] INFO  org.apache.flink.runtime.rest.RestClient  - Rest
endpoint shutdown complete.
09:20:58.947 [main] INFO
o.a.f.r.leaderretrieval.ZooKeeperLeaderRetrievalService  - Stopping
ZooKeeperLeaderRetrievalService /leader/rest_server_lock.
09:20:58.948 [main] INFO
o.a.f.r.leaderretrieval.ZooKeeperLeaderRetrievalService  - Stopping
ZooKeeperLeaderRetrievalService /leader/dispatcher_lock.
09:20:58.949 [Curator-Framework-0] INFO
o.a.f.s.c.o.a.curator.framework.imps.CuratorFrameworkImpl  -
backgroundOperationsLoop exiting
09:20:58.968 [main] INFO
o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Session:
0x100571bda1903b7 closed
09:20:58.968 [main] INFO  org.apache.flink.client.cli.CliFrontend  -
Cancelled job 75e16686cb4fe0d33ead8e29af131d09.
09:20:58.969 [main-EventThread] INFO
o.a.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - EventThread
shut down for session: 0x100571bda1903b7

I'm assuming that in /jobgraphs there should only be the job ids that are
currently running (at least it seemed that when the jobmanager restarted it
tried to restart the jobs ids stored there). Is that correct?

Gerard

On Wed, Jul 18, 2018 at 9:17 AM vino yang <[email protected]> wrote:

> Hi Gerard,
>
> From you provide information, you mean the path in Zookeeper "/jobgraphs"
> exists more jobs than you submitted?
> And can not be restarted because blob files can not be find?
>
> Can you provide more details, about the stack trace, log and which version
> of Flink? Normally, the jobgraph can not be added to Zookeeper except
> submit job manually.
>
> Thanks, vino.
>
> 2018-07-16 21:19 GMT+08:00 gerardg <[email protected]>:
>
>> Hi,
>>
>> Our deployment consists of a standalone HA cluster of 8 machines with an
>> external Zookeeper cluster. We have observed several times that when a
>> jobmanager fails and a new one is elected, the new one tries to restart
>> more jobs than the ones that were running and since it can't find some
>> files, it fails and gets stuck in a restart loop. That is the error that
>> we
>> see in the logs:
>>
>>
>>
>> These are the contents of /home/nas/flink/ha/default/blob/:
>>
>>
>>
>> We've checked zookeeper and there are actually a lot of jobgraphs in
>> /flink/default/jobgraphs
>>
>>
>>
>> There were only three jobs running so neither zookeeper nor the flink 'ha'
>> folder seems to have the correct number of jobgraphs stored.
>>
>> The only way we have to solve this is to remove everything at path /flink
>> in
>> zookeeper and the 'ha' flink folder and restart the jobs manually.
>>
>> I'll try to monitor if some action (e.g. we have been canceling and
>> restoring jobs from savepoints quite often lately) leaves an entry in
>> zookeepers path /flink/default/jobgraphs of a job that is not running but
>> maybe someone can't point us to some configuration problem that could
>> cause
>> this behavior.
>>
>> Thanks,
>>
>> Gerard
>>
>>
>>
>>
>>
>> --
>> Sent from:
>> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/
>>
>
>

Re: When a jobmanager fails, it doesn't restart because it tries to restart non existing tasks

Reply via email to