[jira] [Commented] (NIFI-7866) When cluster coordinator dies, other nodes may have trouble rejoining cluster

ASF subversion and git services (Jira) Fri, 05 Feb 2021 12:21:38 -0800


    [ 
https://issues.apache.org/jira/browse/NIFI-7866?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17279975#comment-17279975
 ]


ASF subversion and git services commented on NIFI-7866:
-------------------------------------------------------

Commit 749d05840ba88efc8b42f5434d9223104edfab68 in nifi's branch 
refs/heads/main from Mark Payne
[ https://gitbox.apache.org/repos/asf?p=nifi.git;h=749d058 ]

NIFI-8204, NIFI-7866: Send revision update count in heartbeats. If update count 
in heartbeat is greater than that of cluster coordinator, request that node 
reconnect to get most up-to-date revisions. Cannot check exact equality, as the 
values may change between the time a heartbeat is created and the time the 
cluster coordinator receives it. However, it should be safe to assume that the 
revision won't be greater than that of the cluster coordinator. There is a tiny 
window in which it could be, as the sending node may update its revision, 
create the heartbeat, send it, and cluster coordinator process it before 
updating its own revision. However, this window is incredibly small and would 
only result in the sending node reconnecting, which will resolve itself. Also, 
when testing this fix, encountered NIFI-7866 and addressed that 
NullPointerException.

This closes #4806.

Signed-off-by: Bryan Bende <[email protected]>


> When cluster coordinator dies, other nodes may have trouble rejoining cluster
> -----------------------------------------------------------------------------
>
>                 Key: NIFI-7866
>                 URL: https://issues.apache.org/jira/browse/NIFI-7866
>             Project: Apache NiFi
>          Issue Type: Bug
>          Components: Core Framework
>            Reporter: Mark Payne
>            Assignee: Mark Payne
>            Priority: Major
>             Fix For: 1.13.0
>
>
> When the cluster coordinator is lost, the nodes must now begin communicating 
> with a newly elected Cluster Coordinator. This is handled through the 
> StandardFlowService.
> When the `handleReconnectionRequest` method is called and the request 
> provided does not contain the dataflow, the node is to connect to the cluster 
> coordinator and request the dataflow:
> {code:java}
> private void handleReconnectionRequest(final ReconnectionRequestMessage 
> request) {
>     try {
>         logger.info("Processing reconnection request from cluster 
> coordinator.");
>         // reconnect
>         ConnectionResponse connectionResponse = new 
> ConnectionResponse(getNodeId(), request.getDataFlow(),
>                 request.getInstanceId(), request.getNodeConnectionStatuses(), 
> request.getComponentRevisions());
>         if (connectionResponse.getDataFlow() == null) {
>             logger.info("Received a Reconnection Request that contained no 
> DataFlow. Will attempt to connect to cluster using local flow.");
>             connectionResponse = connect(false, false, 
> createDataFlowFromController());
>         }
>         loadFromConnectionResponse(connectionResponse);
> ... {code}
> However, if the call above to `connect(false, false, 
> createDataFlowFromController()` returns false (which is a valid case), that 
> null value is passed along to the loadFromConnectionResponse. This method 
> expects a non-null connectionResponse and throws a NullPointerException, 
> resulting in the following stack trace (stack trace based on nifi 1.11.4):
> {code:java}
> 2020-09-29 10:18:53,324 ERROR [Reconnect to Cluster] 
> o.a.nifi.controller.StandardFlowService Handling reconnection request failed 
> due to: org.apache.nifi.cluster.ConnectionException: Failed to connect node 
> to cluster due to: 
> java.lang.NullPointerExceptionorg.apache.nifi.cluster.ConnectionException: 
> Failed to connect node to cluster due to: java.lang.NullPointerExceptionat 
> org.apache.nifi.controller.StandardFlowService.loadFromConnectionResponse(StandardFlowService.java:1035)at
>  
> org.apache.nifi.controller.StandardFlowService.handleReconnectionRequest(StandardFlowService.java:668)at
>  
> org.apache.nifi.controller.StandardFlowService.access$200(StandardFlowService.java:109)at
>  
> org.apache.nifi.controller.StandardFlowService$1.run(StandardFlowService.java:415)at
>  java.lang.Thread.run(Thread.java:748)Caused by: 
> java.lang.NullPointerException: nullat 
> org.apache.nifi.controller.StandardFlowService.loadFromConnectionResponse(StandardFlowService.java:989)...
>  4 common frames omitted {code}
> This results in the node not reconnecting to the cluster.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (NIFI-7866) When cluster coordinator dies, other nodes may have trouble rejoining cluster

Reply via email to