[
https://issues.apache.org/jira/browse/IGNITE-10808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16913204#comment-16913204
]
Sergey Chugunov commented on IGNITE-10808:
------------------------------------------
[~dmekhanikov],
I reviewed your change one more time, it looks good to me now.
I've triggered TC once again to make sure latest refactoring didn't introduce
any problems, if run is green we are good to merge the change.
Thank you for your efforts!
> Discovery message queue may build up with TcpDiscoveryMetricsUpdateMessage
> --------------------------------------------------------------------------
>
> Key: IGNITE-10808
> URL: https://issues.apache.org/jira/browse/IGNITE-10808
> Project: Ignite
> Issue Type: Bug
> Affects Versions: 2.7
> Reporter: Stanislav Lukyanov
> Assignee: Denis Mekhanikov
> Priority: Major
> Labels: discovery
> Fix For: 2.8
>
> Attachments: IgniteMetricsOverflowTest.java
>
>
> A node receives a new metrics update message every `metricsUpdateFrequency`
> milliseconds, and the message will be put at the top of the queue (because it
> is a high priority message).
> If processing one message takes more than `metricsUpdateFrequency` then
> multiple `TcpDiscoveryMetricsUpdateMessage` will be in the queue. A long
> enough delay (e.g. caused by a network glitch or GC) may lead to the queue
> building up tens of metrics update messages which are essentially useless to
> be processed. Finally, if processing a message on average takes a little more
> than `metricsUpdateFrequency` (even for a relatively short period of time,
> say, for a minute due to network issues) then the message worker will end up
> processing only the metrics updates and the cluster will essentially hang.
> Reproducer is attached. In the test, the queue first builds up and then very
> slowly being teared down, causing "Failed to wait for PME" messages.
> Need to change ServerImpl's SocketReader not to put another metrics update
> message to the top of the queue if it already has one (or replace the one at
> the top with new one).
--
This message was sent by Atlassian Jira
(v8.3.2#803003)