Mikhail Petrov created IGNITE-14301:
---------------------------------------

             Summary: Authentication processor can hang all user management 
operation after server node reconnect
                 Key: IGNITE-14301
                 URL: https://issues.apache.org/jira/browse/IGNITE-14301
             Project: Ignite
          Issue Type: Bug
            Reporter: Mikhail Petrov


First for all look at the test - 
AuthenticationProcessorNodeRestartTest#testConcurrentAddUpdateRemoveNodeRestartServer
 - [TC 
history|https://ci.ignite.apache.org/project.html?projectId=IgniteTests24Java8&testNameId=-8873434544416175780&tab=testDetails]

The first problem with this test is that user management 
operations(add/update/remove) create too many discovery messages. So discovery 
custom message history size is not enough to properly skip duplicated custom 
messages that can be sent across the ring during server node reconnect. It 
leads to test failures due to duplication of user management operations (see 
GridDiscoveryManager#discoCacheHist, IGNITE_DISCOVERY_HISTORY_SIZE system 
property, and ServerImpl.RingMessageWorker#sendMessageAcrossRing).

If the discovery history size will be increased significantly, the test stops 
failing and starts hanging. The steps that lead to this:
 1. Client node sent UserProposedMessage across the ring while one node is 
offline due to reconnect. 
 2. Alive server nodes update their local user lists and finish the operation. 
 3. Reconnected node joins the ring and receives an updated user list from the 
coordinator.
 4. Reconnected node receives duplicated UserProposedMessage that has been 
already handled by all nodes, handles it, and sents 
UserManagementOperationFinishedMessage to the coordinator and start to wait for 
the UserAcceptedMessage from it. But the coordinator has already finished this 
operation. So the thread that responsible for user management operation on the 
reconnected node becomes blocked (see 
IgniteAuthenticationProcessor.UserOperationWorker#body).
 5. Client node starts the next operation that needs all alive nodes to respond 
with UserManagementOperationFinishedMessage. But reconnected node 
authentication thread is blocked. So this operation can't be completed at all.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to