Roman Puchkovskiy created IGNITE-20914:
------------------------------------------

             Summary: Make ScaleCube's metadataTimeout configurable
                 Key: IGNITE-20914
                 URL: https://issues.apache.org/jira/browse/IGNITE-20914
             Project: Ignite
          Issue Type: Improvement
            Reporter: Roman Puchkovskiy
             Fix For: 3.0.0-beta2


ScaleCube's MembershipProtocolImpl fetches node's metadata periodically (using 
GetMetaDataRequest). If it does not get a response before metadataTimeout 
expires, it seems to think that the node is not alive anymore and generates a 
REMOVED event:

[2023-11-17T00:20:22,153][WARN ][sc-cluster-3345-2][MembershipProtocol] 
[default:sqllogic1:1ca7b2f5308489d@10.233.107.205:3345][updateMembership][MEMBERSHIP_GOSSIP]
 Skipping to add/update member: \{m: 
default:sqllogic0:6a78c57fcd0a496d@10.233.107.205:3344, s: ALIVE, inc: 9}, due 
to failed fetchMetadata call (cause: java.util.concurrent.TimeoutException: Did 
not observe any item or terminal signal within 1000ms in 'source(MonoDefer)' 
(and no fallback has been configured))

[2023-11-17T00:20:29,189][INFO ][sc-cluster-3345-2][MembershipProtocol] 
[default:sqllogic1:1ca7b2f5308489d@10.233.107.205:3345] Member left without 
notification: default:sqllogic0:6a78c57fcd0a496d@10.233.107.205:3344
[2023-11-17T00:20:29,190][INFO ][sc-cluster-3345-2][MembershipProtocol] 
[default:sqllogic1:1ca7b2f5308489d@10.233.107.205:3345][publishEvent] 
MembershipEvent[type=REMOVED, 
member=default:sqllogic0:6a78c57fcd0a496d@10.233.107.205:3344, 
oldMetadata=1e61c6c8-154, newMetadata=null, timestamp=2023-11-17T00:20:29.189Z]

We should avoid this. It seems that 1 second might be too small for a node 
under load.

We should make this configurable via Ignite configuration.

Also, it probably makes sense to set a higher default (like 10 seconds). The 
reason for the latter is that, if the timeout expires, a node is removed from 
the physical topology and cannot return there without a restart (this is what 
our connection establishment protocol requires), so this timeout is critical 
for stability of Ignite (while it is probably not critical for an average 
ScaleCube-based application).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to