[
https://issues.apache.org/jira/browse/GEODE-10017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17486856#comment-17486856
]
Mario Salazar de Torres edited comment on GEODE-10017 at 2/4/22, 8:46 AM:
--------------------------------------------------------------------------
After some digging I noticed that for the executions on which one of these TCs
get stuck, GFSH tool running the "start server ..." command, never returns.
So I checked why GFSH was not returning and I saw that it was stuck checking
the server state, which was always returning as non-responding, even whenever
there was a server instance running (as I could see by connecting to the
cluster and running list member)/listing processes in the test
container/looking at the logs.
So, looking for all of the possible scenarios which could cause ServerState to
show up as "Not responding" I noticed that there were no PID file inside the
recently started server and this was causing the whole issue.
Now, my theory for why that's happening is the following:
# First instance of the server is notified to be stopped.
# Cache for the first server is closed, and GFSH process stopping the server
exists, returning the control to the test process.
# Given that the from the point of view of the test the server has been
stopped already, it runs the startup for the new server.
# The newer server instance writes the PID file with its PID.
# The older server instance, which was still running, deletes its PID along
some other files and actually terminate its process.
The time gap between step 4 and 5 normally is really tight, so that would
explain why this issue only reproduces sometimes and mostly on overloaded
systems.
was (Author: gaussianrecurrence):
After some digging I noticed that for the executions on which one of these TCs
get stuck, GFSH tool running the "start server ..." command, never returns.
So I checked why GFSH was not returning and I saw that it was stuck checking
the server state, which was always returning as non-responding, even whenever
there was a server instance running (as I could see by connecting to the
cluster and running list member)/listing processes in the test
container/looking at the logs.
So, looking for all of the possible scenarios which could cause ServerState to
show up as "Not responding" I noticed that there were no PID file inside the
recently started server and this was causing the whole issue.
Now, my theory for why that's happening is the following:
# First instance of the server is notified to be stopped.
# Cache for the first server is closed, and GFSH process stopping the server
exists, returning the control to the test process.
# Given that the from the point of view of the test the server has been
stopped already, it runs the startup for the new server.
# The newer server instance writes the PID file with its PID.
# The older server instance, which was still running, deletes its PID along
some other files and actually terminate its process.
The time gap between step 4 and 5 normally is really tight, so that would
explain why this issue only reproduces sometimes and mostly on overloaded
systems.
> Fix new ITs unstability for TCs that involve members restart
> ------------------------------------------------------------
>
> Key: GEODE-10017
> URL: https://issues.apache.org/jira/browse/GEODE-10017
> Project: Geode
> Issue Type: Bug
> Components: native client
> Reporter: Mario Salazar de Torres
> Priority: Major
> Labels: needsTriage
>
> *GIVEN* an integration TC on which a server restart needs to be restarted
> *WHEN* the server is starting up again
> *THEN* +{color:#172b4d}it might{color}+{color:#172b4d} happen that the TC
> gets stuck{color}
> ----
> *Additional information.* This issue does not always happens, and I've seen
> it happening more frequently with the latest version of Geode server (1.15.0)
> Some examples of this TC are:
> * RegisterKeysTest.RegisterKeySetAndClusterRestart
> * PartitionRegionWithRedundancyTest.putgetWithSingleHop
> * ···
> Also, this is normally the exec flow for the TCs that get stuck:
> # Setup cluster
> # Do TC specific ops
> # Stop server(s)/Cluster shutdown
> # Start server(s)
> In all cases, the server gets stuck at step 4
--
This message was sent by Atlassian Jira
(v8.20.1#820001)