zhuyuemufeng opened a new issue, #7927:
URL: https://github.com/apache/rocketmq/issues/7927

   ### Before Creating the Bug Report
   
   - [X] I found a bug, not just asking a question, which should be created in 
[GitHub Discussions](https://github.com/apache/rocketmq/discussions).
   
   - [X] I have searched the [GitHub 
Issues](https://github.com/apache/rocketmq/issues) and [GitHub 
Discussions](https://github.com/apache/rocketmq/discussions)  of this 
repository and believe that this is not a duplicate.
   
   - [X] I have confirmed that this bug belongs to the current repository, not 
other repositories of RocketMQ.
   
   
   ### Runtime platform environment
   
   linux
   
   ### RocketMQ version
   
   5.1.x
   
   ### JDK Version
   
   jdk 1.8
   
   ### Describe the Bug
   
   When the enableSlaveActingMaster switch is turned on and a master node goes 
down, the slave node attempts to deliver scheduled messages to other master 
nodes with a maximum of four retries. I find this retry mechanism somewhat 
unreasonable. For instance, if there's a temporary network interruption causing 
the remote master node to be temporarily unreachable, it may take up to eight 
retries for messages to select another available master node. During this 
process, some messages may be lost.
   Code:
   
![image](https://github.com/apache/rocketmq/assets/51144340/d2c8eaf9-35a7-4beb-9d5c-fcbc0c7f65d0)
   
![image](https://github.com/apache/rocketmq/assets/51144340/59e213be-4080-4614-afec-496aeb14fa55)
   My approach is to keep looping until a successful remote delivery is 
achieved. This ensures that no messages are lost, as I believe the severity of 
message loss outweighs the inconvenience of temporary blocked delivery.
   
   ### Steps to Reproduce
   
   1.Set up a cluster with 3 masters and 3 slaves, and enable the 
enableSlaveActingMaster feature.
   2.Send 100 scheduled messages to the cluster with message time range between 
1 to 3 minutes.
   3.Start consumption, and during the consumption process, shut down one of 
the master nodes.
   4.When a slave delivers scheduled messages and the network connection to a 
specific master is disconnected for a period of time before being restored.
   By comparing the sent messages with the consumed messages, you may encounter 
message loss and Broker errors.
   
   ### What Did You Expect to See?
   
   If remote delivery fails, continue looping until a viable node is found.
   
   ### What Did You See Instead?
   
   retrun PUT_NEED_RETRY
   
   ### Additional Context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@rocketmq.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to