Juan Ramos created GEODE-9512:
---------------------------------
Summary: java.lang.IllegalStateException: Detected old version
(pre 5.0.1) of GemFire or non-GemFire during handshake due to initial byte
being 1
Key: GEODE-9512
URL: https://issues.apache.org/jira/browse/GEODE-9512
Project: Geode
Issue Type: Bug
Components: membership
Affects Versions: 1.15.0
Reporter: Juan Ramos
I've hit this issue while executing some chaos testing over a GemFire cluster
using 2 locators and 3 servers; {{SSL}} is enabled, and a dummy
{{SecurityManager}} is configured which authenticates any user and always
returns {{true}} within the {{authorize}} method. There are 3
{{PARTITION_REDUNDANT}} regions configured, one per client, each with 1
redundant copy. Once the cluster is up and running, 3 clients continuously
execute {{Region.put}} operations on a well known set of keys for its own
{{Region}} (created with {{PROXY}} type), and another process executes the
following logic in parallel (pseudocode):
{noformat}
for server in ${servers}
do
# Pause the JVM for 30 seconds to simulate a stop the world GC
kill -STOP server
sleep 30
# Unpause the JVM, wait for member to reconnect and regions to recover
redundancy configured
kill -CONT "${SERVER_PID}"
waitForReconnectcompletedInServerLog
waitForNumBucketsWithoutRedundancyToBeZeroInGfshShowRegionMetrics
done
{noformat}
The test works fine most of the time, but randomly fails due to an unexpected
exception logged within the logs for at least one locator. The exception is
always reported from a {{P2P message reader}} thread on the locator for a
server member that has just returned to life, as an example:
{noformat}
#### LOCATOR-0
[warn 2021/08/17 05:20:45.166 GMT system-test-gemfire-locator-0 <P2P message
reader for
system-test-gemfire-server-2(system-test-gemfire-server-2:1)<v4>:41000 unshared
ordered sender uid=61 dom #1 local port=48141 remote port=46174> tid=0x6f] P2P
message reader@354fac47 timed out during a membership check.
[fatal 2021/08/17 05:20:45.166 GMT system-test-gemfire-locator-0 <P2P message
reader for
system-test-gemfire-server-2(system-test-gemfire-server-2:1)<v4>:41000 unshared
ordered sender uid=61 dom #1 local port=48141 remote port=46174> tid=0x6f]
Error deserializing P2P handshake message
java.lang.IllegalStateException: Detected old version (pre 5.0.1) of GemFire or
non-GemFire during handshake due to initial byte being 1
at
org.apache.geode.internal.tcp.Connection.readHandshakeForReceiver(Connection.java:2875)
at
org.apache.geode.internal.tcp.Connection.processInputBuffer(Connection.java:2825)
at
org.apache.geode.internal.tcp.Connection.readMessages(Connection.java:1649)
at org.apache.geode.internal.tcp.Connection.run(Connection.java:1489)
at
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:829)
#### SERVER-2
[warn 2021/08/17 05:20:44.012 GMT system-test-gemfire-server-2 <StatSampler>
tid=0x35] Statistics sampling thread detected a wakeup delay of 29070 ms,
indicating a possible resource issue. Check the GC, memory, and CPU statistics.
[warn 2021/08/17 05:20:44.015 GMT system-test-gemfire-server-2 <Geode Heartbeat
Sender> tid=0x23] Failure detection heartbeat-generation thread overslept by
more than a full period. Asleep time: 31,175,291,931 nanoseconds. Period:
2,500,000,000 nanoseconds.
[info 2021/08/17 05:20:44.143 GMT system-test-gemfire-server-2 <unicast
receiver,system-test-gemfire-server-2-25596> tid=0x1e] saving cache server
configuration for use with the cluster-configuration service on reconnect
[info 2021/08/17 05:20:44.143 GMT system-test-gemfire-server-2 <unicast
receiver,system-test-gemfire-server-2-25596> tid=0x1e] cache server
configuration saved
[info 2021/08/17 05:20:44.233 GMT system-test-gemfire-server-2
<DisconnectThread> tid=0xe5] Stopping membership services
[info 2021/08/17 05:20:44.455 GMT system-test-gemfire-server-2
<ReconnectThread> tid=0xe5] Disconnecting old DistributedSystem to prepare for
a reconnect attempt
[info 2021/08/17 05:20:44.463 GMT system-test-gemfire-server-2
<ReconnectThread> tid=0xe5] GemFireCache[id = 252990056; isClosing = true;
isShutDownAll = false; created = Tue Aug 17 05:11:50 GMT 2021; server = true;
copyOnRead = false; lockLease = 120; lockTimeout = 60]: Now closing.
[info 2021/08/17 05:20:44.544 GMT system-test-gemfire-server-2
<ReconnectThread> tid=0xe5] Cache server on port 40404 is shutting down.
[info 2021/08/17 05:20:44.565 GMT system-test-gemfire-server-2 <Queue Removal
Thread> tid=0x5e] The QueueRemovalThread is done.
{noformat}
Contrary to what the exception message states, it's worth noticing that *_all
members within the cluster (including clients) are using the same Geode
version._*
—
Below are some extra logs from when I was able to reproduce the issue with
{{log-level=fine}} configured on all members:
{noformat}
#### LOCATOR-0
[debug 2021/08/16 20:40:22.858 GMT system-test-gemfire-locator-0 <P2P message
reader for
system-test-gemfire-server-1(system-test-gemfire-server-1:1)<v3>:41000 unshared
ordered sender uid=47 dom #1 local port=58373 remote port=57818> tid=0x87] P2P
handshake remoteAddr is
system-test-gemfire-server-1(system-test-gemfire-server-1:1)<v3>:41000 (GEODE
1.15.0)
[warn 2021/08/16 20:40:23.858 GMT system-test-gemfire-locator-0 <P2P message
reader for
system-test-gemfire-server-1(system-test-gemfire-server-1:1)<v3>:41000 unshared
ordered sender uid=47 dom #1 local port=58373 remote port=57818> tid=0x87] P2P
message reader@6673b475 timed out during a membership check.
[fatal 2021/08/16 20:40:23.858 GMT system-test-gemfire-locator-0 <P2P message
reader for
system-test-gemfire-server-1(system-test-gemfire-server-1:1)<v3>:41000 unshared
ordered sender uid=47 dom #1 local port=58373 remote port=57818> tid=0x87]
Error deserializing P2P handshake message
java.lang.IllegalStateException: Detected old version (pre 5.0.1) of GemFire or
non-GemFire during handshake due to initial byte being 1
at
org.apache.geode.internal.tcp.Connection.readHandshakeForReceiver(Connection.java:2875)
at
org.apache.geode.internal.tcp.Connection.processInputBuffer(Connection.java:2825)
at
org.apache.geode.internal.tcp.Connection.readMessages(Connection.java:1649)
at org.apache.geode.internal.tcp.Connection.run(Connection.java:1489)
at
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:829)
[debug 2021/08/16 20:40:23.863 GMT system-test-gemfire-locator-0 <P2P message
reader for
system-test-gemfire-server-1(system-test-gemfire-server-1:1)<v3>:41000 unshared
ordered sender uid=47 dom #1 local port=58373 remote port=57818> tid=0x87]
readMessages terminated id=/10.12.0.9:58373 from
system-test-gemfire-server-1(system-test-gemfire-server-1:1)<v3>:41000
isHandshakeReader=false
[debug 2021/08/16 20:40:23.863 GMT system-test-gemfire-locator-0 <P2P message
reader for
system-test-gemfire-server-1(system-test-gemfire-server-1:1)<v3>:41000 unshared
ordered sender uid=47 dom #1 local port=58373 remote port=57818> tid=0x87]
Stopping P2P message reader@6673b475 for
system-test-gemfire-server-1(system-test-gemfire-server-1:1)<v3>:41000
#### SERVER-1
[warn 2021/08/16 20:40:22.698 GMT system-test-gemfire-server-1 <StatSampler>
tid=0x35] Statistics sampling thread detected a wakeup delay of 29219 ms,
indicating a possible resource issue. Check the GC, memory, and CPU statistics.
...
[debug 2021/08/16 20:40:22.856 GMT system-test-gemfire-server-1
<ClientMembership Event Invoker1> tid=0x58] Starting thread for P2P handshake
reader@6161c139-47
[debug 2021/08/16 20:40:22.856 GMT system-test-gemfire-server-1 <P2P handshake
reader@6161c139-47> tid=0x33] Starting P2P handshake reader@6161c139-47 on
Socket[addr=/10.12.0.9,port=58373,localport=57818]
[debug 2021/08/16 20:40:22.856 GMT system-test-gemfire-server-1 <P2P handshake
reader@6161c139-47> tid=0x33] readMessages terminated id=/10.12.1.6:44219 from
system-test-gemfire-locator-0(system-test-gemfire-locator-0:1:locator)<ec><v0>:41000
isHandshakeReader=false
[debug 2021/08/16 20:40:22.856 GMT system-test-gemfire-server-1 <P2P handshake
reader@6161c139-47> tid=0x33] Stopping P2P handshake reader@6161c139-47 for
system-test-gemfire-locator-0(system-test-gemfire-locator-0:1:locator)<ec><v0>:41000
[debug 2021/08/16 20:40:22.856 GMT system-test-gemfire-server-1
<ClientMembership Event Invoker1> tid=0x58] ConnectionTable: created an ordered
connection:
system-test-gemfire-locator-0(system-test-gemfire-locator-0:1:locator)<ec><v0>:41000(uid=47)
[debug 2021/08/16 20:40:23.059 GMT system-test-gemfire-server-1
<ClientMembership Event Invoker1> tid=0x58] Sending on these 1 connections:
[system-test-gemfire-locator-0(system-test-gemfire-locator-0:1:locator)<ec><v0>:41000(uid=47)]
[debug 2021/08/16 20:40:23.065 GMT system-test-gemfire-server-1
<ClientMembership Event Invoker1> tid=0x58] distribution of message aborted by
shutdown:
UpdateOperation(EntryEventImpl[op=CREATE;region=/_notificationRegion_10.12.1.6<v3>41000;key=org.apache.geode.management.internal.NotificationKey@9b5d432b;callbackArg=null;originRemote=false;originMember=system-test-gemfire-server-1(system-test-gemfire-server-1:1)<v3>:41000;callbacksInvoked;id=EventID[system-test-gemfire-server-1(system-test-gemfire-server-1)<v3>:41000;threadID=2;sequenceID=5]])
{noformat}
I'm not familiar with this part of the code but, from the above logs, it looks
like the same reader ({{reader@6161c139-47}}) is being used by two different
threads at the same time?, is this expected?, might that be the cause of the
exception we're seeing?.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)