[jira] [Created] (NIFI-8204) When Cluster Coordinator dies suddenly, is possible for Component Revisions to be inconsistent across nodes in cluster

Mark Payne (Jira) Fri, 05 Feb 2021 07:51:04 -0800

Mark Payne created NIFI-8204:
--------------------------------

             Summary: When Cluster Coordinator dies suddenly, is possible for 
Component Revisions to be inconsistent across nodes in cluster
                 Key: NIFI-8204
                 URL: https://issues.apache.org/jira/browse/NIFI-8204
             Project: Apache NiFi
          Issue Type: Bug
          Components: Core Framework
            Reporter: Mark Payne
            Assignee: Mark Payne
             Fix For: 1.13.0



I encountered a scenario in a 2-node cluster where Node 0 was the Cluster 
Coordinator. It suddenly died and was restarted by the RunNiFi process. The 
restart occurred more quickly than the zookeeper session timeout. Once the node 
was rejoined to the cluster, I started to see errors when attempting to modify 
a component that "Node xyz is unable to fulfill this request due to  [0, null, 
<uuid>] is not the most up-to-date revision. This component appears to have 
been modified."

Refreshing the browser did not help. This indicates that nodes in the cluster 
have different component revisions.

After looking through logs, here is the series of events that led to this 
situation:

 
Node 0 restarts but is still Cluster Coordinator. Has topology showing all 
nodes disconnected, all revisions empty.
Node 1 heartbeats to Node 0. Node 0 responds saying: Your cluster topology is 
wrong. node-1 should be DISCONNECTED due to Has Not Yet Connected.
Node 1 updates topology as directed
Node 1 becomes cluster coordinator because Node 0 hasn't yet connected and its 
ZooKeeper session times out
Node 1 receives heartbeat from itself
Node 1 determines that it hasn't yet connected (based on topology received from 
Node 0) so issues reconnection request.
Node 1 changes state of Node 1 from DISCONNECTED to CONNECTING. Notifies Node 0 
of the topology update.
Node 1 relinquishes role as cluster coordinator
Node 1 requests (to itself) to join cluster
Node 1 receives ConnectionResponse (from itself) that includes a collection of 
79 revisions
Node 0 finishes startup. Has set of empty revisions.
Node 0 becomes cluster coordinator
Node 1 sends heartbeat to Node 0
Node 0 marks Node 1 as Connected to Cluster
 
We should address this by keeping track of the number of updates to the 
Revision Manager and sending this in Heartbeat messages. When the Cluster 
Coordinator receives a heartbeat, it should compare the update count to its own 
internal update count. If the heartbeat's update count is higher, it should 
request that the sending node reconnect to the cluster. This will ensure that 
if this situation were to arise again, the node would reconnect and get the 
most up-to-date set of revisions.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (NIFI-8204) When Cluster Coordinator dies suddenly, is possible for Component Revisions to be inconsistent across nodes in cluster

Reply via email to