[ 
https://issues.apache.org/jira/browse/NIFI-13885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17954774#comment-17954774
 ] 

Shubham Sharma commented on NIFI-13885:
---------------------------------------

Yes, this could be an issue with Xodus as well. While Xodus generally offers 
better performance, it may lead to reconnection issues under similar use cases 
or scenarios. In some cases, Xodus might throw exceptions—such as the one 
reported in NIFI-12468. However, to understand the root cause more thoroughly, 
the community needs additional data, including diagnostic outputs like 
bin/nifi.sh dump <SOMEFILE> and relevant event logs. This information is 
crucial to further investigate what’s going wrong.

> NiFi revision manager and heartbeat validation conflicts leading to 
> reconnection request
> ----------------------------------------------------------------------------------------
>
>                 Key: NIFI-13885
>                 URL: https://issues.apache.org/jira/browse/NIFI-13885
>             Project: Apache NiFi
>          Issue Type: Bug
>    Affects Versions: 1.23.0, 1.24.0, 1.23.2
>            Reporter: Shubham Sharma
>            Priority: Critical
>         Attachments: hkl25073070_jstack.tar.gz, hkl25073177_jstack.tar.gz
>
>
> With a 5 node NiFi cluster deployment we observed a critical issue where 
> reconnection requests are getting generated frequently due to difference in 
> local coordinator's and given node's revisionUpdateCounter, affecting 
> workflow lifecycle.
> *User Log summary*
> *User request to update controller service on random node(test01)*
> {code:java}
> 2024-10-14 18:44:55,927 INFO [NiFi Web Server-151009] 
> o.a.n.w.s.NiFiAuthenticationFilter Authentication Success [test-app] 
> 10.10.21.9 PUT 
> https://test01.dvl:9443/nifi-api/controller-services/7f2fe434-8d01-3d6e-83f5-87d47e6e9598{code}
>  
> *User request forwarded immediately to coordinator(test02)*
>  
> {code:java}
>   2024-10-14 18:44:55,929 INFO [NiFi Web Server-1414436] 
> o.a.n.w.s.NiFiAuthenticationFilter Authentication Success [test-app] 
> 10.10.21.9 PUT 
> https://test02:9443/nifi-api/controller-services/7f2fe434-8d01-3d6e-83f5-87d47e6e9598
>     {code}
>  
> *Delay to complete request as shown here in coordinator (test02)*
> {code:java}
> 2024-10-14 18:46:17,952 INFO [NiFi Web Server-1414436] 
> o.a.n.w.s.NiFiAuthenticationFilter Authentication Success [test-app] 
> 10.10.21.9 GET 
> https://test02:9443/nifi-api/controller-services/7f2fe434-8d01-3d6e-83f5-87d47e6e9598{code}
>  
> *App Log analysis*
> Added a log statement at the end of updateRevision function in 
> NaiveRevisionManager for printing updated value of revisionUpdateCounter.
>  
> *Coordinator app log summary*
> On main node test01 it took around 48 seconds to process above request for 
> updating disabled controller service via restAPI and update 
> revisionUpdateCounter to *3081375*
>  
> *2024-10-14 18:45:43,455* INFO [NiFi Web Server-151009] 
> o.a.n.web.revision.NaiveRevisionManager ---- Updated revision using [[61, 
> ${clientid}, 7f2fe434-8d01-3d6e-83f5-87d47e6e9598]], revisionList [[61, 
> ${clientid}, 7f2fe434-8d01-3d6e-83f5-87d47e6e9598]], revisionUpdateCounter 
> *3081375*
>  
> *Coordinator (test02)received its own heartbeat*
>  
> *2024-10-14 18:45:47,143* INFO [Process Cluster Protocol Request-62] 
> o.a.n.c.c.h.ClusterProtocolHeartbeatMonitor ----- Received new heartbeat from 
> test02:9443, revisionId {*}3081374{*}, activethreadCount 79 , 
> connectionStatus CONNECTED -----
>  
>  *Received test01 node's heartbeat request to coordinator BUT with updated 
> revisionID*
> *2024-10-14 18:45:57,724* INFO [Process Cluster Protocol Request-91] 
> o.a.n.c.c.h.ClusterProtocolHeartbeatMonitor ----- Received new heartbeat from 
> test01:9443, revisionId {*}3081375{*}, activethreadCount 73 , 
> connectionStatus CONNECTED -----
>  
> *Scheduled Heartbeat validation gets triggered but due to above mismatch it 
> requested reconnection for test01*
> *2024-10-14 18:46:07,654* WARN [Heartbeat Monitor Thread-1] 
> o.a.n.c.c.node.NodeClusterCoordinator ----- Requesting that {*}test01{*}:9443 
> reconnect to the cluster due to: Node has a Revision Update Count of 
> *3081375* but local value is only {*}3081374{*}. Node appears not to have the 
> appropriate set of Component Revisions -----
>  
> *During reconnection request on test01 node, ie. going under reconnection and 
> synchronisation process, finally coordinator received updated revision 
> counter*
> *2024-10-14 18:46:11,238* INFO [NiFi Web Server-1414436] 
> o.a.n.web.revision.NaiveRevisionManager ---- Updated revision using [[61, 
> ${clientid}, 7f2fe434-8d01-3d6e-83f5-87d47e6e9598]], revisionList [[61, 
> ${clientid}, 7f2fe434-8d01-3d6e-83f5-87d47e6e9598]], revisionUpdateCounter 
> *3081375*
>  
> *And next heartbeat* *sent by coordinator (test02) to itself* *containing 
> updated revision counter* 
> *2024-10-14 18:46:17,213* INFO [Process Cluster Protocol Request-2] 
> o.a.n.c.c.h.ClusterProtocolHeartbeatMonitor ----- Received new heartbeat from 
> {*}test02{*}:9443, revisionId {*}3081375{*}, activethreadCount 111 , 
> connectionStatus CONNECTED -----
>  
> This reconnection attempts is happening frequently and affecting workflow 
> request.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to