[I] [Bug] The retry num for wheelTimer message escape may result in message loss. [rocketmq]

via GitHub Thu, 14 Mar 2024 23:43:11 -0700


zhuyuemufeng opened a new issue, #7927:
URL: https://github.com/apache/rocketmq/issues/7927

### Before Creating the Bug Report

- [X] I found a bug, not just asking a question, which should be created in
[GitHub Discussions](https://github.com/apache/rocketmq/discussions).

- [X] I have searched the [GitHub
Issues](https://github.com/apache/rocketmq/issues) and [GitHub
Discussions](https://github.com/apache/rocketmq/discussions) of this
repository and believe that this is not a duplicate.

- [X] I have confirmed that this bug belongs to the current repository, not
other repositories of RocketMQ.

### Runtime platform environment

linux

### RocketMQ version

5.1.x

### JDK Version

jdk 1.8

### Describe the Bug

When the enableSlaveActingMaster switch is turned on and a master node goes
down, the slave node attempts to deliver scheduled messages to other master
nodes with a maximum of four retries. I find this retry mechanism somewhat
unreasonable. For instance, if there's a temporary network interruption causing
the remote master node to be temporarily unreachable, it may take up to eight
retries for messages to select another available master node. During this
process, some messages may be lost.
Code:

![image](https://github.com/apache/rocketmq/assets/51144340/d2c8eaf9-35a7-4beb-9d5c-fcbc0c7f65d0)

![image](https://github.com/apache/rocketmq/assets/51144340/59e213be-4080-4614-afec-496aeb14fa55)
My approach is to keep looping until a successful remote delivery is
achieved. This ensures that no messages are lost, as I believe the severity of
message loss outweighs the inconvenience of temporary blocked delivery.

### Steps to Reproduce

1.Set up a cluster with 3 masters and 3 slaves, and enable the
enableSlaveActingMaster feature.
2.Send 100 scheduled messages to the cluster with message time range between
1 to 3 minutes.
3.Start consumption, and during the consumption process, shut down one of
the master nodes.
4.When a slave delivers scheduled messages and the network connection to a
specific master is disconnected for a period of time before being restored.
By comparing the sent messages with the consumed messages, you may encounter
message loss and Broker errors.

### What Did You Expect to See?

If remote delivery fails, continue looping until a viable node is found.

### What Did You See Instead?

retrun PUT_NEED_RETRY

### Additional Context

_No response_

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@rocketmq.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[I] [Bug] The retry num for wheelTimer message escape may result in message loss. [rocketmq]

Reply via email to