[ 
https://issues.apache.org/jira/browse/GEODE-10017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17486856#comment-17486856
 ] 

Mario Salazar de Torres edited comment on GEODE-10017 at 2/4/22, 8:46 AM:
--------------------------------------------------------------------------

After some digging I noticed that for the executions on which one of these TCs 
get stuck, GFSH tool running the "start server ..." command, never returns.
So I checked why GFSH was not returning and I saw that it was stuck checking 
the server state, which was always returning as non-responding, even whenever 
there was a server instance running (as I could see by connecting to the 
cluster and running list member)/listing processes in the test 
container/looking at the logs.

So, looking for all of the possible scenarios which could cause ServerState to 
show up as "Not responding" I noticed that there were no PID file inside the 
recently started server and this was causing the whole issue.

Now, my theory for why that's happening is the following:
 # First instance of the server is notified to be stopped.
 # Cache for the first server is closed, and GFSH process stopping the server 
exists, returning the control to the test process.
 # Given that the from the point of view of the test the server has been 
stopped already, it runs the startup for the new server.
 # The newer server instance writes the PID file with its PID.
 # The older server instance, which was still running, deletes its PID along 
some other files and actually terminate its process.

The time gap between step 4 and 5 normally is really tight, so that would 
explain why this issue only reproduces sometimes and mostly on overloaded 
systems.


was (Author: gaussianrecurrence):
After some digging I noticed that for the executions on which one of these TCs 
get stuck, GFSH tool running the "start server ..." command, never returns.
So I checked why GFSH was not returning and I saw that it was stuck checking 
the server state, which was always returning as non-responding, even whenever 
there was a server instance running (as I could see by connecting to the 
cluster and running list member)/listing processes in the test 
container/looking at the logs.

So, looking for all of the possible scenarios which could cause ServerState to 
show up as "Not responding" I noticed that there were no PID file inside the 
recently started server and this was causing the whole issue.

Now, my theory for why that's happening is the following:

 
 # First instance of the server is notified to be stopped.
 # Cache for the first server is closed, and GFSH process stopping the server 
exists, returning the control to the test process.
 # Given that the from the point of view of the test the server has been 
stopped already, it runs the startup for the new server.
 # The newer server instance writes the PID file with its PID.
 # The older server instance, which was still running, deletes its PID along 
some other files and actually terminate its process.

The time gap between step 4 and 5 normally is really tight, so that would 
explain why this issue only reproduces sometimes and mostly on overloaded 
systems.

> Fix new ITs unstability for TCs that involve members restart
> ------------------------------------------------------------
>
>                 Key: GEODE-10017
>                 URL: https://issues.apache.org/jira/browse/GEODE-10017
>             Project: Geode
>          Issue Type: Bug
>          Components: native client
>            Reporter: Mario Salazar de Torres
>            Priority: Major
>              Labels: needsTriage
>
> *GIVEN* an integration TC on which a server restart needs to be restarted
> *WHEN* the server is starting up again
> *THEN* +{color:#172b4d}it might{color}+{color:#172b4d} happen that the TC 
> gets stuck{color}
> ----
> *Additional information.* This issue does not always happens, and I've seen 
> it happening more frequently with the latest version of Geode server (1.15.0)
> Some examples of this TC are:
>  * RegisterKeysTest.RegisterKeySetAndClusterRestart
>  * PartitionRegionWithRedundancyTest.putgetWithSingleHop
>  * ···
> Also, this is normally the exec flow for the TCs that get stuck:
>  # Setup cluster
>  # Do TC specific ops
>  # Stop server(s)/Cluster shutdown
>  # Start server(s)
> In all cases, the server gets stuck at step 4



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to