[
https://issues.apache.org/jira/browse/NIFI-7866?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17279974#comment-17279974
]
ASF subversion and git services commented on NIFI-7866:
-------------------------------------------------------
Commit 749d05840ba88efc8b42f5434d9223104edfab68 in nifi's branch
refs/heads/main from Mark Payne
[ https://gitbox.apache.org/repos/asf?p=nifi.git;h=749d058 ]
NIFI-8204, NIFI-7866: Send revision update count in heartbeats. If update count
in heartbeat is greater than that of cluster coordinator, request that node
reconnect to get most up-to-date revisions. Cannot check exact equality, as the
values may change between the time a heartbeat is created and the time the
cluster coordinator receives it. However, it should be safe to assume that the
revision won't be greater than that of the cluster coordinator. There is a tiny
window in which it could be, as the sending node may update its revision,
create the heartbeat, send it, and cluster coordinator process it before
updating its own revision. However, this window is incredibly small and would
only result in the sending node reconnecting, which will resolve itself. Also,
when testing this fix, encountered NIFI-7866 and addressed that
NullPointerException.
This closes #4806.
Signed-off-by: Bryan Bende <[email protected]>
> When cluster coordinator dies, other nodes may have trouble rejoining cluster
> -----------------------------------------------------------------------------
>
> Key: NIFI-7866
> URL: https://issues.apache.org/jira/browse/NIFI-7866
> Project: Apache NiFi
> Issue Type: Bug
> Components: Core Framework
> Reporter: Mark Payne
> Assignee: Mark Payne
> Priority: Major
> Fix For: 1.13.0
>
>
> When the cluster coordinator is lost, the nodes must now begin communicating
> with a newly elected Cluster Coordinator. This is handled through the
> StandardFlowService.
> When the `handleReconnectionRequest` method is called and the request
> provided does not contain the dataflow, the node is to connect to the cluster
> coordinator and request the dataflow:
> {code:java}
> private void handleReconnectionRequest(final ReconnectionRequestMessage
> request) {
> try {
> logger.info("Processing reconnection request from cluster
> coordinator.");
> // reconnect
> ConnectionResponse connectionResponse = new
> ConnectionResponse(getNodeId(), request.getDataFlow(),
> request.getInstanceId(), request.getNodeConnectionStatuses(),
> request.getComponentRevisions());
> if (connectionResponse.getDataFlow() == null) {
> logger.info("Received a Reconnection Request that contained no
> DataFlow. Will attempt to connect to cluster using local flow.");
> connectionResponse = connect(false, false,
> createDataFlowFromController());
> }
> loadFromConnectionResponse(connectionResponse);
> ... {code}
> However, if the call above to `connect(false, false,
> createDataFlowFromController()` returns false (which is a valid case), that
> null value is passed along to the loadFromConnectionResponse. This method
> expects a non-null connectionResponse and throws a NullPointerException,
> resulting in the following stack trace (stack trace based on nifi 1.11.4):
> {code:java}
> 2020-09-29 10:18:53,324 ERROR [Reconnect to Cluster]
> o.a.nifi.controller.StandardFlowService Handling reconnection request failed
> due to: org.apache.nifi.cluster.ConnectionException: Failed to connect node
> to cluster due to:
> java.lang.NullPointerExceptionorg.apache.nifi.cluster.ConnectionException:
> Failed to connect node to cluster due to: java.lang.NullPointerExceptionat
> org.apache.nifi.controller.StandardFlowService.loadFromConnectionResponse(StandardFlowService.java:1035)at
>
> org.apache.nifi.controller.StandardFlowService.handleReconnectionRequest(StandardFlowService.java:668)at
>
> org.apache.nifi.controller.StandardFlowService.access$200(StandardFlowService.java:109)at
>
> org.apache.nifi.controller.StandardFlowService$1.run(StandardFlowService.java:415)at
> java.lang.Thread.run(Thread.java:748)Caused by:
> java.lang.NullPointerException: nullat
> org.apache.nifi.controller.StandardFlowService.loadFromConnectionResponse(StandardFlowService.java:989)...
> 4 common frames omitted {code}
> This results in the node not reconnecting to the cluster.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)