[ https://issues.apache.org/jira/browse/GEODE-10017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17486856#comment-17486856 ]
Mario Salazar de Torres edited comment on GEODE-10017 at 2/4/22, 8:46 AM: -------------------------------------------------------------------------- After some digging I noticed that for the executions on which one of these TCs get stuck, GFSH tool running the "start server ..." command, never returns. So I checked why GFSH was not returning and I saw that it was stuck checking the server state, which was always returning as non-responding, even whenever there was a server instance running (as I could see by connecting to the cluster and running list member)/listing processes in the test container/looking at the logs. So, looking for all of the possible scenarios which could cause ServerState to show up as "Not responding" I noticed that there were no PID file inside the recently started server and this was causing the whole issue. Now, my theory for why that's happening is the following: # First instance of the server is notified to be stopped. # Cache for the first server is closed, and GFSH process stopping the server exists, returning the control to the test process. # Given that the from the point of view of the test the server has been stopped already, it runs the startup for the new server. # The newer server instance writes the PID file with its PID. # The older server instance, which was still running, deletes its PID along some other files and actually terminate its process. The time gap between step 4 and 5 normally is really tight, so that would explain why this issue only reproduces sometimes and mostly on overloaded systems. was (Author: gaussianrecurrence): After some digging I noticed that for the executions on which one of these TCs get stuck, GFSH tool running the "start server ..." command, never returns. So I checked why GFSH was not returning and I saw that it was stuck checking the server state, which was always returning as non-responding, even whenever there was a server instance running (as I could see by connecting to the cluster and running list member)/listing processes in the test container/looking at the logs. So, looking for all of the possible scenarios which could cause ServerState to show up as "Not responding" I noticed that there were no PID file inside the recently started server and this was causing the whole issue. Now, my theory for why that's happening is the following: # First instance of the server is notified to be stopped. # Cache for the first server is closed, and GFSH process stopping the server exists, returning the control to the test process. # Given that the from the point of view of the test the server has been stopped already, it runs the startup for the new server. # The newer server instance writes the PID file with its PID. # The older server instance, which was still running, deletes its PID along some other files and actually terminate its process. The time gap between step 4 and 5 normally is really tight, so that would explain why this issue only reproduces sometimes and mostly on overloaded systems. > Fix new ITs unstability for TCs that involve members restart > ------------------------------------------------------------ > > Key: GEODE-10017 > URL: https://issues.apache.org/jira/browse/GEODE-10017 > Project: Geode > Issue Type: Bug > Components: native client > Reporter: Mario Salazar de Torres > Priority: Major > Labels: needsTriage > > *GIVEN* an integration TC on which a server restart needs to be restarted > *WHEN* the server is starting up again > *THEN* +{color:#172b4d}it might{color}+{color:#172b4d} happen that the TC > gets stuck{color} > ---- > *Additional information.* This issue does not always happens, and I've seen > it happening more frequently with the latest version of Geode server (1.15.0) > Some examples of this TC are: > * RegisterKeysTest.RegisterKeySetAndClusterRestart > * PartitionRegionWithRedundancyTest.putgetWithSingleHop > * ··· > Also, this is normally the exec flow for the TCs that get stuck: > # Setup cluster > # Do TC specific ops > # Stop server(s)/Cluster shutdown > # Start server(s) > In all cases, the server gets stuck at step 4 -- This message was sent by Atlassian Jira (v8.20.1#820001)