Hey all,

Just wanted to report this for posterity in case someone else sees something 
similar. We were running Flink 1.2.1 in Kubernetes. We use an HA setup with an 
external Zookeeper and S3 for checkpointing. We recently noticed a job that 
appears to have deadlocked on JobManager Leader election. We think the issue 
happened something like the following:

  1.  Job was up and running normally
  2.  Some cluster event caused the JobManager Pod (process) to get restarted.
  3.  JobManager came up again but got stuck on LeaderElection. At this time 
the JobManager UI sent back a response with "Service temporarily unavailable 
due to an ongoing leader election. Please refresh."
  4.  JobManager never exited this state.
  5.  This state persisted across Pod restarts/deletes.

In order to try to pin down this problem we brought up the Job in another Flink 
Cluster and started debugging this issue. As a first step I upped the logging 
level on the JobManager and applied the change. This resulted in the following 
log (Full logs attached to this email):


2017-09-05 18:50:55,072 TRACE 
org.apache.flink.shaded.org.apache.curator.utils.DefaultTracerDriver  - Trace: 
GetDataBuilderImpl-Background - 4 ms

2017-09-05 18:50:55,075 DEBUG 
org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore  - Received 
CHILD_ADDED event (path: /4e4cdc8fb8c1437c620cd4063bd265e1)

2017-09-05 18:50:55,088 DEBUG 
org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore  - Received 
CHILD_ADDED event notification for job 4e4cdc8fb8c1437c620cd4063bd265e1

2017-09-05 18:50:56,072 DEBUG org.apache.zookeeper.ClientCnxn                   
            - Reading reply sessionid:0x15e3df8b5a57b4b, packet:: 
clientPath:null serverPath:null finished:false header:: 1,3  replyHeader:: 
1,21476396435,0  request:: '/traveler-profile.us-west-2c.test.expedia.com,F  
response:: 
s{4305881524,4305881524,1497456679255,1497456679255,0,3,0,0,0,3,21475572278}

2017-09-05 18:50:56,078 DEBUG org.apache.zookeeper.ClientCnxn                   
            - Reading reply sessionid:0x15e3df8b5a57b4b, packet:: 
clientPath:null serverPath:null finished:false header:: 2,3  replyHeader:: 
2,21476396435,0  request:: 
'/traveler-profile.us-west-2c.test.expedia.com/krazyglue,F  response:: 
s{4305881525,4305881525,1497456679260,1497456679260,0,1,0,0,0,1,4305881526}

2017-09-05 18:50:56,079 DEBUG org.apache.zookeeper.ClientCnxn                   
            - Reading reply sessionid:0x15e3df8b5a57b4b, packet:: 
clientPath:null serverPath:null finished:false header:: 3,3  replyHeader:: 
3,21476396435,0  request:: 
'/traveler-profile.us-west-2c.test.expedia.com/krazyglue/mip,F  response:: 
s{4305881526,4305881526,1497456679263,1497456679263,0,1,0,0,0,1,4305881527}

2017-09-05 18:50:56,080 DEBUG org.apache.zookeeper.ClientCnxn                   
            - Reading reply sessionid:0x15e3df8b5a57b4b, packet:: 
clientPath:null serverPath:null finished:false header:: 4,3  replyHeader:: 
4,21476396435,0  request:: 
'/traveler-profile.us-west-2c.test.expedia.com/krazyglue/mip/flink,F  
response:: 
s{4305881527,4305881527,1497456679267,1497456679267,0,1,0,0,0,1,4305881528}

2017-09-05 18:50:56,081 DEBUG org.apache.zookeeper.ClientCnxn                   
            - Reading reply sessionid:0x15e3df8b5a57b4b, packet:: 
clientPath:null serverPath:null finished:false header:: 5,3  replyHeader:: 
5,21476396435,0  request:: 
'/traveler-profile.us-west-2c.test.expedia.com/krazyglue/mip/flink/default,F  
response:: 
s{4305881528,4305881528,1497456679270,1497456679270,0,27,0,0,0,5,21475449005}

2017-09-05 18:50:56,085 INFO  
org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  - 
Starting ZooKeeperLeaderRetrievalService.

2017-09-05 18:50:56,087 DEBUG org.apache.zookeeper.ClientCnxn                   
            - Reading reply sessionid:0x15e3df8b5a57b4b, packet:: 
clientPath:null serverPath:null finished:false header:: 6,3  replyHeader:: 
6,21476396435,0  request:: '/traveler-profile.us-west-2c.test.expedia.com,F  
response:: 
s{4305881524,4305881524,1497456679255,1497456679255,0,3,0,0,0,3,21475572278}

2017-09-05 18:50:56,087 DEBUG org.apache.zookeeper.ClientCnxn                   
            - Reading reply sessionid:0x15e3df8b5a57b4b, packet:: 
clientPath:null serverPath:null finished:false header:: 7,3  replyHeader:: 
7,21476396435,0  request:: 
'/traveler-profile.us-west-2c.test.expedia.com/krazyglue,F  response:: 
s{4305881525,4305881525,1497456679260,1497456679260,0,1,0,0,0,1,4305881526}

2017-09-05 18:50:56,088 DEBUG org.apache.zookeeper.ClientCnxn                   
            - Reading reply sessionid:0x15e3df8b5a57b4b, packet:: 
clientPath:null serverPath:null finished:false header:: 8,3  replyHeader:: 
8,21476396435,0  request:: 
'/traveler-profile.us-west-2c.test.expedia.com/krazyglue/mip,F  response:: 
s{4305881526,4305881526,1497456679263,1497456679263,0,1,0,0,0,1,4305881527}

2017-09-05 18:50:56,090 DEBUG org.apache.zookeeper.ClientCnxn                   
            - Reading reply sessionid:0x15e3df8b5a57b4b, packet:: 
clientPath:null serverPath:null finished:false header:: 9,3  replyHeader:: 
9,21476396435,0  request:: 
'/traveler-profile.us-west-2c.test.expedia.com/krazyglue/mip/flink,F  
response:: 
s{4305881527,4305881527,1497456679267,1497456679267,0,1,0,0,0,1,4305881528}

2017-09-05 18:50:56,091 DEBUG org.apache.zookeeper.ClientCnxn                   
            - Reading reply sessionid:0x15e3df8b5a57b4b, packet:: 
clientPath:null serverPath:null finished:false header:: 10,3  replyHeader:: 
10,21476396435,0  request:: 
'/traveler-profile.us-west-2c.test.expedia.com/krazyglue/mip/flink/default,F  
response:: 
s{4305881528,4305881528,1497456679270,1497456679270,0,27,0,0,0,5,21475449005}

2017-09-05 18:50:56,092 DEBUG org.apache.zookeeper.ClientCnxn                   
            - Reading reply sessionid:0x15e3df8b5a57b4b, packet:: 
clientPath:/traveler-profile.us-west-2c.test.expedia.com/krazyglue/mip/flink/default/leader
 
serverPath:/traveler-profile.us-west-2c.test.expedia.com/krazyglue/mip/flink/default/leader
 finished:false header:: 11,3  replyHeader:: 11,21476396435,0  request:: 
'/traveler-profile.us-west-2c.test.expedia.com/krazyglue/mip/flink/default/leader,T
  response:: 
s{21475449005,21475449005,1503519415753,1503519415753,0,1,0,0,0,1,21475449006}

2017-09-05 18:50:56,092 TRACE 
org.apache.flink.shaded.org.apache.curator.utils.DefaultTracerDriver  - Trace: 
ExistsBuilderImpl-Background - 1 ms

2017-09-05 18:50:56,093 DEBUG org.apache.zookeeper.ClientCnxn                   
            - Reading reply sessionid:0x15e3df8b5a57b4b, packet:: 
clientPath:/traveler-profile.us-west-2c.test.expedia.com/krazyglue/mip/flink/default/leader
 
serverPath:/traveler-profile.us-west-2c.test.expedia.com/krazyglue/mip/flink/default/leader
 finished:false header:: 12,4  replyHeader:: 12,21476396435,0  request:: 
'/traveler-profile.us-west-2c.test.expedia.com/krazyglue/mip/flink/default/leader,T
  response:: 
,s{21475449005,21475449005,1503519415753,1503519415753,0,1,0,0,0,1,21475449006}

2017-09-05 18:50:56,093 TRACE 
org.apache.flink.shaded.org.apache.curator.utils.DefaultTracerDriver  - Trace: 
GetDataBuilderImpl-Background - 1 ms

2017-09-05 18:50:56,093 DEBUG 
org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  - 
Leader node has changed.

At this point I dug into the logs to try and see what was going on. It quickly 
became apparent that this probably had something to do with Apache Curator 
which Flink relies upon to do the leader election. At this point I grabbed a 
dump of zookeeper using: https://github.com/ctapmex/zkTreeUtil:

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<root path="/traveler-profile.us-west-2c.test.expedia.com/krazyglue/mip/flink">
  <zknode name="default">
    <zknode name="leaderlatch">
      <zknode name="00000000000000000000000000000000">
        <zknode name="job_manager_lock"/>
      </zknode>
      <zknode ephemeral="true" 
name="_c_9cd3f122-65de-40e7-8e2b-aa34e6a85b83-latch-0000000020"/>
    </zknode>
    <zknode name="checkpoint-counter">
      <zknode name="4e4cdc8fb8c1437c620cd4063bd265e1" value="&#0;&#0;&#0;&#1;"/>
    </zknode>
    <zknode name="checkpoints">
      <zknode name="4e4cdc8fb8c1437c620cd4063bd265e1"/>
    </zknode>
    <zknode name="jobgraphs">
      <zknode name="4e4cdc8fb8c1437c620cd4063bd265e1" 
value="��&#0;&#5;sr&#0;;org.apache.flink.runtime.state.RetrievableStreamStateHandle&#0;&#1;&#30;&#24;�U�+&#2;&#0;&#1;L&#0;&#24;wrappedStreamStateHandlet&#0;2Lorg/apache/flink/runtime/state/StreamStateHandle;xpsr&#0;9org.apache.flink.runtime.state.filesystem.FileStateHandle&#4;�u�b�&#27;�&#2;&#0;&#2;J&#0;&#9;stateSizeL&#0;&#8;filePatht&#0;&#31;Lorg/apache/flink/core/fs/Path;xp&#0;&#0;&#0;&#0;&#0;&#15;h�sr&#0;&#29;org.apache.flink.core.fs.Path&#0;&#0;&#0;&#0;&#0;&#0;&#0;&#1;&#2;&#0;&#1;L&#0;&#3;urit&#0;&#14;Ljava/net/URI;xpsr&#0;&#12;java.net.URI�&#1;x.C�I�&#3;&#0;&#1;L&#0;&#6;stringt&#0;&#18;Ljava/lang/String;xpt&#0;�s3a://ewetest-traveler-profile/client-state/krazyglue/traveler-profile.us-west-2c.test.expedia.com/krazyglue/mip/flink/zookeeperCheckpoints/submittedJobGraph6507f982fe76x"/>
    </zknode>
    <zknode name="leader">
      <zknode name="00000000000000000000000000000000"/>
    </zknode>
  </zknode>
</root>

In a healthy cluster there is generally a node under “job_manager_lock” which 
represents the elected leader. This pointed to a problem with the Apache 
Curator framework. I opened up both Flink 1.2. And 1.3.2 to see what version of 
Curator Flink 1.2.1 and 1.3.2 used. I noticed that Flink 1.3.2 uses Curator 
2.12.0 whereas Flink 1.2.1 uses 2.8.0. Looking at the release notes for curator 
there were quite a few bugs might be the cause of this 
(https://issues.apache.org/jira/browse/CURATOR-264 for example). I was curious 
to see if leader election would become unstuck if the cluster was upgraded to 
Flink 1.3.2. I upgraded the JobManager to 1.3.2 and sure enough leader election 
proceeded correctly. Upon Upgrading the TaskManagers the job restored from its 
last checkpoint correctly and continued.

Other interesting things of note:

  1.  Restarting/Killing the JobManager didn’t resolve the issue. I suspect 
this is due to bad state in zookeeper
  2.  Zookeeper was running throughout this time. We didn’t see issues in our 
kafka cluster and Zookeeper has a leader elected correctly and the Zookeeper 
command line utility worked without issue.
  3.  Restarts of the JobManager appeared to correctly create ephemeral nodes 
in zookeeper to represent possible leaders (which also appear to get cleaned up 
correctly).

As far as I can tell this issue is resolved with the newer Apache Curator 
version in 1.3.2. It may be useful to keep an eye out for this issue however.

Thanks,
James Bucher

Reply via email to