[jira] [Commented] (NIFI-8196) When a node is disconnected due to failing to service a request, upon cluster reconnection it may not participate in leader election

ASF subversion and git services (Jira) Fri, 05 Feb 2021 12:21:10 -0800


    [ 
https://issues.apache.org/jira/browse/NIFI-8196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17279972#comment-17279972
 ]


ASF subversion and git services commented on NIFI-8196:
-------------------------------------------------------

Commit 03fd59eb2fa21fdd693a37da0f7fd402bbc74933 in nifi's branch 
refs/heads/main from Mark Payne
[ https://gitbox.apache.org/repos/asf?p=nifi.git;h=03fd59e ]

NIFI-8196: When node is reconnected to cluster, ensure that it re-registers for 
election of cluster coordinator / primary node. On startup, if cluster 
coordinator is already registered and is 'this node' then register silently as 
coordinator and do not join the cluster until there is no Cluster Coordinator 
or another node is elected. This allows the zookeeper session timeout to elapse.

Signed-off-by: Bryan Bende <[email protected]>


> When a node is disconnected due to failing to service a request, upon cluster 
> reconnection it may not participate in leader election
> ------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: NIFI-8196
>                 URL: https://issues.apache.org/jira/browse/NIFI-8196
>             Project: Apache NiFi
>          Issue Type: Bug
>          Components: Core Framework
>            Reporter: Mark Payne
>            Assignee: Mark Payne
>            Priority: Blocker
>             Fix For: 1.13.0
>
>          Time Spent: 20m
>  Remaining Estimate: 0h
>
> NIFI-7920 fixed a bug that can result in nodes getting the wrong Revision for 
> some components. The fix for that, however, appears to have caused a 
> regression. When a Node is disconnected due to failing to service a 
> replicated API request, such as a component being stopped/started/moved, it 
> will now unregister from leader election for Primary Node / Cluster 
> Coordinator. However, if it then reconnects, it does not re-register for the 
> roles. As a result, we can have a situation where a node disconnects and 
> reconnects and never is able to become Cluster Coordinator. If this happens 
> to all nodes in a cluster, we can end up where no nodes are eligible to 
> become Cluster Coordinator. This results in logs such as:
> {code:java}
> 2021-02-03 20:14:55,167 WARN [Clustering Tasks Thread-3] 
> o.apache.nifi.controller.FlowController Failed to send heartbeat due to: 
> java.lang.IllegalArgumentException: Cannot send heartbeat to address []. 
> Address must be in <hostname>:<port> format {code}
> And errors in the UI stating:
> {code:java}
> Action cannot be performed because there is currently no Cluster Coordinator 
> elected. The request should be tried again after a moment, after a Cluster 
> Coordinator has been automatically elected.. Returning Service Unavailable 
> response. {code}
> At this point, there will never be a cluster coordinator until nodes are 
> restarted.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (NIFI-8196) When a node is disconnected due to failing to service a request, upon cluster reconnection it may not participate in leader election

Reply via email to