Kris20030907 opened a new issue, #9290:
URL: https://github.com/apache/rocketmq/issues/9290

   ### Before Creating the Enhancement Request
   
   - [x] I have confirmed that this should be classified as an enhancement 
rather than a bug/feature.
   
   
   ### Summary
   
   Solution: Add a configurable switch to the Client, add health detection for 
namesrv, and record the number of consecutive failures. When the number reaches 
the set value, actively call nettyRemotingClient to disconnect the channel of 
the currently selected namesrv to avoid the situation where the channel is hung 
for a long time and refresh the connection. At the same time, it can also 
ensure that in the event of an abnormality in the network of a certain namesrv, 
re-initiate the connection to other available namesrv in the namesrv list to 
work normally.
   
   ### Motivation
   
   Problem scenario: After the Client (Producer, Consumer) establishes a 
connection with Namesrv, it will regularly pull TopicRouteInfo from namesrv. 
However, once the machine where namesrv is located is disconnected from the 
Internet (TCP level), or the network fluctuates for a long time, or the 
firewall jitters, the application layer channel cannot be detected as abnormal 
and will not be disconnected. All requests based on this connection will not be 
able to obtain data normally due to timeout.
   
   ### Describe the Solution You'd Like
   
   1. Add a client-level switch (such as enableNamesrvCheck) to turn on or off 
the Namesrv health check function.
   2. After calling updateTopicRouteInfoFromNameServer() in the scheduled task, 
add logic:
       - If the call is successful, reset the continuous failure count 
(namesrvHealthCheckFailCount) to 0.
       - If the call throws an exception, increase the failure count; when the 
failure count reaches the threshold set by 
clientConfig.getMaxClientNamesrvCheckFailedCnt(),
   actively call NettyRemotingClient.closeUnHealthyNamesrvChannel() to 
disconnect the currently selected Namesrv channel and reset the failure count 
to 0.
   3. This improvement ensures that only continuous failures will trigger the 
disconnection operation, and the failure count will be immediately reset to 
zero after Namesrv returns to normal, ensuring that the disconnection logic 
will not be triggered by mistake.
   4. In addition, when a Namesrv exception is detected, the mechanism can 
automatically refresh the routing information and try to connect to other 
available nodes in the Namesrv list, thereby improving the availability of the 
overall system.
   
   ### Describe Alternatives You've Considered
   
   1. Rely only on the existing IdleStateHandler and TCP layer SO_KEEPALIVE: 
**Since the default OS keepalive time interval is too long**, and 
IdleStateHandler cannot directly detect TCP disconnection, it is difficult to 
meet real-time requirements by relying on these mechanisms alone.
   2. Force the use of short-timeout RPC calls to detect anomalies: Although 
this method can detect anomalies to a certain extent, it may increase the 
misjudgment rate under normal network fluctuations and affect the success rate 
of business requests.
   3. Use TCP_USER_TIMEOUT: Although this parameter can detect network 
disconnection faster, it is not available or inconvenient to modify in the 
current environment, and may introduce cross-platform compatibility issues.
   Therefore, adding a health detection switch and recording the number of 
consecutive failures, and then actively disconnecting unhealthy Namesrv 
connections, becomes a more reliable and flexible solution.
   
   ### Additional Context
   
   This improvement has been verified through local testing. During Namesrv 
abnormalities, it can accurately record the number of consecutive failures and 
actively disconnect unhealthy channels after reaching the set threshold. When 
Namesrv recovers, the scheduled task can reset the failure count normally and 
continue to update the routing information.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@rocketmq.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to