[
https://issues.apache.org/jira/browse/ZOOKEEPER-4394?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17872050#comment-17872050
]
Benoit Sigoure edited comment on ZOOKEEPER-4394 at 8/8/24 3:07 PM:
-------------------------------------------------------------------
One of our dev clusters ran into this bug.
Quorum: zookeeper-0, zookeeper-1(leader), zookeeper-2
Client app holding an ephemeral znode exits:
{code:java}
# zookeeper-0 log
2024-08-07 15:22:58,929 [myid:] - INFO
[NIOWorkerThread-4:o.a.z.s.NIOServerCnxn@337] - Unable to read additional data
from client, it probably closed the socket: address = /10.128.31.11:42544,
session = 0x30019eb942d0028
{code}
shortly thereafter:
{code:java}
# zookeeper-2 log
2024-08-07 15:23:00,009 [myid:] - INFO
[QuorumPeer[myid=3](plain=0.0.0.0:2181)(secure=disabled):o.a.z.s.q.QuorumPeer@922]
- Peer state changed: following - synchronization
2024-08-07 15:23:00,010 [myid:] - INFO
[QuorumPeer[myid=3](plain=0.0.0.0:2181)(secure=disabled):o.a.z.s.q.CommitProcessor@490]
- Configuring CommitProcessor with readBatchSize -1 commitBatchSize 1
2024-08-07 15:23:00,010 [myid:] - INFO
[QuorumPeer[myid=3](plain=0.0.0.0:2181)(secure=disabled):o.a.z.s.q.CommitProcessor@451]
- Configuring CommitProcessor with 8 worker threads.
2024-08-07 15:23:00,010 [myid:] - INFO
[QuorumPeer[myid=3](plain=0.0.0.0:2181)(secure=disabled):o.a.z.s.q.FollowerRequestProcessor@59]
- Initialized FollowerRequestProcessor with
zookeeper.follower.skipLearnerRequestToNextProcessor as false
2024-08-07 15:23:00,011 [myid:] - WARN
[QuorumPeer[myid=3](plain=0.0.0.0:2181)(secure=disabled):o.a.z.j.MBeanRegistry@110]
- Failed to register MBean InMemoryDataTree
2024-08-07 15:23:00,011 [myid:] - WARN
[QuorumPeer[myid=3](plain=0.0.0.0:2181)(secure=disabled):o.a.z.s.q.LearnerZooKeeperServer@104]
- Failed to register with JMX
javax.management.InstanceAlreadyExistsException:
org.apache.ZooKeeperService:name0=ReplicatedServer_id3,name1=replica.3,name2=Follower,name3=InMemoryDataTree
at java.management/com.sun.jmx.mbeanserver.Repository.addMBean(Unknown
Source)
at
java.management/com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.registerWithRepository(Unknown
Source)
at
java.management/com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.registerDynamicMBean(Unknown
Source)
at
java.management/com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.registerObject(Unknown
Source)
at
java.management/com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.registerMBean(Unknown
Source)
at
java.management/com.sun.jmx.mbeanserver.JmxMBeanServer.registerMBean(Unknown
Source)
at
org.apache.zookeeper.jmx.MBeanRegistry.register(MBeanRegistry.java:106)
at
org.apache.zookeeper.server.quorum.LearnerZooKeeperServer.registerJMX(LearnerZooKeeperServer.java:102)
at
org.apache.zookeeper.server.ZooKeeperServer.startupWithServerState(ZooKeeperServer.java:730)
at
org.apache.zookeeper.server.ZooKeeperServer.startupWithoutServing(ZooKeeperServer.java:713)
at
org.apache.zookeeper.server.quorum.Learner.syncWithLeader(Learner.java:760)
at
org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:108)
at
org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:1539)
2024-08-07 15:23:00,012 [myid:] - INFO
[QuorumPeer[myid=3](plain=0.0.0.0:2181)(secure=disabled):o.a.z.s.q.Learner@721]
- Learner received UPTODATE message
2024-08-07 15:23:00,013 [myid:] - INFO
[QuorumPeer[myid=3](plain=0.0.0.0:2181)(secure=disabled):o.a.z.s.q.QuorumPeer@917]
- Peer state changed: following - broadcast
2024-08-07 15:23:00,015 [myid:] - INFO
[CommitProcessor:3:o.a.z.s.q.LearnerSessionTracker@116] - Committing global
session 0x10000242c8c49bc
2024-08-07 15:23:00,015 [myid:] - INFO
[CommitProcessor:3:o.a.z.s.q.LearnerSessionTracker@116] - Committing global
session 0x10000242c8c49bd
2024-08-07 15:23:00,016 [myid:] - INFO
[CommitProcessor:3:o.a.z.s.q.LearnerSessionTracker@116] - Committing global
session 0x10000242c8c49be
2024-08-07 15:23:00,103 [myid:] - WARN
[QuorumPeer[myid=3](plain=0.0.0.0:2181)(secure=disabled):o.a.z.s.q.Follower@169]
- Got zxid 0x632029b34c expected 0x1
2024-08-07 15:23:00,604 [myid:] - INFO
[CommitProcessor:3:o.a.z.s.q.LearnerSessionTracker@116] - Committing global
session 0x10000242c8c49bf
2024-08-07 15:23:02,202 [myid:] - INFO
[CommitProcessor:3:o.a.z.s.q.LearnerSessionTracker@116] - Committing global
session 0x10000242c8c49c0
2024-08-07 15:23:05,426 [myid:] - INFO
[CommitProcessor:3:o.a.z.s.q.LearnerSessionTracker@116] - Committing global
session 0x200001b71c42ddb
2024-08-07 15:23:05,671 [myid:] - INFO
[NIOWorkerThread-1:o.a.z.s.NIOServerCnxn@522] - Processing ruok command from
/127.0.0.1:48192
2024-08-07 15:23:05,671 [myid:] - INFO
[NIOWorkerThread-8:o.a.z.s.NIOServerCnxn@522] - Processing ruok command from
/127.0.0.1:48184
2024-08-07 15:23:07,087 [myid:] - WARN [SyncThread:3:o.a.z.s.p.FileTxnLog@394]
- fsync-ing the write ahead log in SyncThread:3 took 7075ms which will
adversely effect operation latency.File size is 67108880 bytes. See the
ZooKeeper troubleshooting guide
2024-08-07 15:23:10,107 [myid:] - WARN
[QuorumPeer[myid=3](plain=0.0.0.0:2181)(secure=disabled):o.a.z.s.q.Follower@128]
- Exception when following the leader
java.io.EOFException: null
at java.base/java.io.DataInputStream.readInt(Unknown Source)
at
org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:96)
at
org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:86)
at
org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:134)
at
org.apache.zookeeper.server.quorum.Learner.readPacket(Learner.java:228)
at
org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:124)
at
org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:1539)
2024-08-07 15:23:10,108 [myid:] - INFO
[QuorumPeer[myid=3](plain=0.0.0.0:2181)(secure=disabled):o.a.z.s.q.Follower@142]
- Disconnected from leader (with address:
zookeeper-1-zookeeper.zookeeper.svc.cluster.local./172.29.203.237:2888). Was
connected for 17136ms. Sync state: true
2024-08-07 15:23:10,108 [myid:] - INFO
[QuorumPeer[myid=3](plain=0.0.0.0:2181)(secure=disabled):o.a.z.s.q.Follower@286]
- shutdown Follower
2024-08-07 15:23:10,108 [myid:] - INFO
[QuorumPeer[myid=3](plain=0.0.0.0:2181)(secure=disabled):o.a.z.s.ZooKeeperServer@853]
- shutting down
2024-08-07 15:23:10,108 [myid:] - INFO
[QuorumPeer[myid=3](plain=0.0.0.0:2181)(secure=disabled):o.a.z.s.RequestThrottler@256]
- Shutting down
2024-08-07 15:23:10,108 [myid:] - INFO
[RequestThrottler:o.a.z.s.RequestThrottler@217] - Draining request throttler
queue
2024-08-07 15:23:10,108 [myid:] - INFO
[RequestThrottler:o.a.z.s.RequestThrottler@195] - RequestThrottler shutdown.
Dropped 0 requests
2024-08-07 15:23:10,108 [myid:] - INFO
[QuorumPeer[myid=3](plain=0.0.0.0:2181)(secure=disabled):o.a.z.s.q.FollowerRequestProcessor@172]
- Shutting down
2024-08-07 15:23:10,108 [myid:] - INFO
[QuorumPeer[myid=3](plain=0.0.0.0:2181)(secure=disabled):o.a.z.s.q.CommitProcessor@631]
- Shutting down
2024-08-07 15:23:10,108 [myid:] - INFO
[QuorumPeer[myid=3](plain=0.0.0.0:2181)(secure=disabled):o.a.z.s.FinalRequestProcessor@684]
- shutdown of request processor complete
2024-08-07 15:23:10,108 [myid:] - INFO
[FollowerRequestProcessor:3:o.a.z.s.q.FollowerRequestProcessor@128] -
FollowerRequestProcessor exited loop!
2024-08-07 15:23:10,108 [myid:] - INFO
[CommitProcessor:3:o.a.z.s.q.CommitProcessor@419] - CommitProcessor exited loop!
2024-08-07 15:23:10,191 [myid:] - ERROR
[QuorumPeer[myid=3](plain=0.0.0.0:2181)(secure=disabled):o.a.z.s.p.Util@166] -
Last transaction was partial.
{code}
Resulting state: ephemeral znode held by 10.128.31.11:42544 was correctly
removed on zookeeper-0 and zookeeper-1 but lingers on zookeeper-2.
Env:
{code}
2024-08-07 16:18:46,793 [myid:3] - INFO [main:o.a.z.Environment@98] - Server
environment:zookeeper.version=3.8.2-139d619b58292d7734b4fc83a0f44be4e7b0c986,
built on 2023-07-05 19:24 UTC
2024-08-07 16:18:46,793 [myid:3] - INFO [main:o.a.z.Environment@98] - Server
environment:host.name=zookeeper-2.zookeeper.zookeeper.svc.cluster.local
2024-08-07 16:18:46,794 [myid:3] - INFO [main:o.a.z.Environment@98] - Server
environment:java.version=11.0.20.1
2024-08-07 16:18:46,794 [myid:3] - INFO [main:o.a.z.Environment@98] - Server
environment:java.vendor=Eclipse Adoptium
2024-08-07 16:18:46,794 [myid:3] - INFO [main:o.a.z.Environment@98] - Server
environment:java.home=/opt/java/openjdk
{code}
was (Author: tsuna):
One of our dev clusters ran into this bug.
Quorum: zookeeper-0, zookeeper-1(leader), zookeeper-2
Client app holding an ephemeral znode exits:
{code:java}
# zookeeper-0 log
2024-08-07 15:22:58,929 [myid:] - INFO
[NIOWorkerThread-4:o.a.z.s.NIOServerCnxn@337] - Unable to read additional data
from client, it probably closed the socket: address = /10.128.31.11:42544,
session = 0x30019eb942d0028
{code}
shortly thereafter:
{code:java}
# zookeeper-2 log
2024-08-07 15:23:00,009 [myid:] - INFO
[QuorumPeer[myid=3](plain=0.0.0.0:2181)(secure=disabled):o.a.z.s.q.QuorumPeer@922]
- Peer state changed: following - synchronization
2024-08-07 15:23:00,010 [myid:] - INFO
[QuorumPeer[myid=3](plain=0.0.0.0:2181)(secure=disabled):o.a.z.s.q.CommitProcessor@490]
- Configuring CommitProcessor with readBatchSize -1 commitBatchSize 1
2024-08-07 15:23:00,010 [myid:] - INFO
[QuorumPeer[myid=3](plain=0.0.0.0:2181)(secure=disabled):o.a.z.s.q.CommitProcessor@451]
- Configuring CommitProcessor with 8 worker threads.
2024-08-07 15:23:00,010 [myid:] - INFO
[QuorumPeer[myid=3](plain=0.0.0.0:2181)(secure=disabled):o.a.z.s.q.FollowerRequestProcessor@59]
- Initialized FollowerRequestProcessor with
zookeeper.follower.skipLearnerRequestToNextProcessor as false
2024-08-07 15:23:00,011 [myid:] - WARN
[QuorumPeer[myid=3](plain=0.0.0.0:2181)(secure=disabled):o.a.z.j.MBeanRegistry@110]
- Failed to register MBean InMemoryDataTree
2024-08-07 15:23:00,011 [myid:] - WARN
[QuorumPeer[myid=3](plain=0.0.0.0:2181)(secure=disabled):o.a.z.s.q.LearnerZooKeeperServer@104]
- Failed to register with JMX
javax.management.InstanceAlreadyExistsException:
org.apache.ZooKeeperService:name0=ReplicatedServer_id3,name1=replica.3,name2=Follower,name3=InMemoryDataTree
at java.management/com.sun.jmx.mbeanserver.Repository.addMBean(Unknown
Source)
at
java.management/com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.registerWithRepository(Unknown
Source)
at
java.management/com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.registerDynamicMBean(Unknown
Source)
at
java.management/com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.registerObject(Unknown
Source)
at
java.management/com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.registerMBean(Unknown
Source)
at
java.management/com.sun.jmx.mbeanserver.JmxMBeanServer.registerMBean(Unknown
Source)
at
org.apache.zookeeper.jmx.MBeanRegistry.register(MBeanRegistry.java:106)
at
org.apache.zookeeper.server.quorum.LearnerZooKeeperServer.registerJMX(LearnerZooKeeperServer.java:102)
at
org.apache.zookeeper.server.ZooKeeperServer.startupWithServerState(ZooKeeperServer.java:730)
at
org.apache.zookeeper.server.ZooKeeperServer.startupWithoutServing(ZooKeeperServer.java:713)
at
org.apache.zookeeper.server.quorum.Learner.syncWithLeader(Learner.java:760)
at
org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:108)
at
org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:1539)
2024-08-07 15:23:00,012 [myid:] - INFO
[QuorumPeer[myid=3](plain=0.0.0.0:2181)(secure=disabled):o.a.z.s.q.Learner@721]
- Learner received UPTODATE message
2024-08-07 15:23:00,013 [myid:] - INFO
[QuorumPeer[myid=3](plain=0.0.0.0:2181)(secure=disabled):o.a.z.s.q.QuorumPeer@917]
- Peer state changed: following - broadcast
2024-08-07 15:23:00,015 [myid:] - INFO
[CommitProcessor:3:o.a.z.s.q.LearnerSessionTracker@116] - Committing global
session 0x10000242c8c49bc
2024-08-07 15:23:00,015 [myid:] - INFO
[CommitProcessor:3:o.a.z.s.q.LearnerSessionTracker@116] - Committing global
session 0x10000242c8c49bd
2024-08-07 15:23:00,016 [myid:] - INFO
[CommitProcessor:3:o.a.z.s.q.LearnerSessionTracker@116] - Committing global
session 0x10000242c8c49be
2024-08-07 15:23:00,103 [myid:] - WARN
[QuorumPeer[myid=3](plain=0.0.0.0:2181)(secure=disabled):o.a.z.s.q.Follower@169]
- Got zxid 0x632029b34c expected 0x1
2024-08-07 15:23:00,604 [myid:] - INFO
[CommitProcessor:3:o.a.z.s.q.LearnerSessionTracker@116] - Committing global
session 0x10000242c8c49bf
2024-08-07 15:23:02,202 [myid:] - INFO
[CommitProcessor:3:o.a.z.s.q.LearnerSessionTracker@116] - Committing global
session 0x10000242c8c49c0
2024-08-07 15:23:05,426 [myid:] - INFO
[CommitProcessor:3:o.a.z.s.q.LearnerSessionTracker@116] - Committing global
session 0x200001b71c42ddb
2024-08-07 15:23:05,671 [myid:] - INFO
[NIOWorkerThread-1:o.a.z.s.NIOServerCnxn@522] - Processing ruok command from
/127.0.0.1:48192
2024-08-07 15:23:05,671 [myid:] - INFO
[NIOWorkerThread-8:o.a.z.s.NIOServerCnxn@522] - Processing ruok command from
/127.0.0.1:48184
2024-08-07 15:23:07,087 [myid:] - WARN [SyncThread:3:o.a.z.s.p.FileTxnLog@394]
- fsync-ing the write ahead log in SyncThread:3 took 7075ms which will
adversely effect operation latency.File size is 67108880 bytes. See the
ZooKeeper troubleshooting guide
2024-08-07 15:23:10,107 [myid:] - WARN
[QuorumPeer[myid=3](plain=0.0.0.0:2181)(secure=disabled):o.a.z.s.q.Follower@128]
- Exception when following the leader
java.io.EOFException: null
at java.base/java.io.DataInputStream.readInt(Unknown Source)
at
org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:96)
at
org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:86)
at
org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:134)
at
org.apache.zookeeper.server.quorum.Learner.readPacket(Learner.java:228)
at
org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:124)
at
org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:1539)
2024-08-07 15:23:10,108 [myid:] - INFO
[QuorumPeer[myid=3](plain=0.0.0.0:2181)(secure=disabled):o.a.z.s.q.Follower@142]
- Disconnected from leader (with address:
zookeeper-1-zookeeper.zookeeper.svc.cluster.local./172.29.203.237:2888). Was
connected for 17136ms. Sync state: true
2024-08-07 15:23:10,108 [myid:] - INFO
[QuorumPeer[myid=3](plain=0.0.0.0:2181)(secure=disabled):o.a.z.s.q.Follower@286]
- shutdown Follower
2024-08-07 15:23:10,108 [myid:] - INFO
[QuorumPeer[myid=3](plain=0.0.0.0:2181)(secure=disabled):o.a.z.s.ZooKeeperServer@853]
- shutting down
2024-08-07 15:23:10,108 [myid:] - INFO
[QuorumPeer[myid=3](plain=0.0.0.0:2181)(secure=disabled):o.a.z.s.RequestThrottler@256]
- Shutting down
2024-08-07 15:23:10,108 [myid:] - INFO
[RequestThrottler:o.a.z.s.RequestThrottler@217] - Draining request throttler
queue
2024-08-07 15:23:10,108 [myid:] - INFO
[RequestThrottler:o.a.z.s.RequestThrottler@195] - RequestThrottler shutdown.
Dropped 0 requests
2024-08-07 15:23:10,108 [myid:] - INFO
[QuorumPeer[myid=3](plain=0.0.0.0:2181)(secure=disabled):o.a.z.s.q.FollowerRequestProcessor@172]
- Shutting down
2024-08-07 15:23:10,108 [myid:] - INFO
[QuorumPeer[myid=3](plain=0.0.0.0:2181)(secure=disabled):o.a.z.s.q.CommitProcessor@631]
- Shutting down
2024-08-07 15:23:10,108 [myid:] - INFO
[QuorumPeer[myid=3](plain=0.0.0.0:2181)(secure=disabled):o.a.z.s.FinalRequestProcessor@684]
- shutdown of request processor complete
2024-08-07 15:23:10,108 [myid:] - INFO
[FollowerRequestProcessor:3:o.a.z.s.q.FollowerRequestProcessor@128] -
FollowerRequestProcessor exited loop!
2024-08-07 15:23:10,108 [myid:] - INFO
[CommitProcessor:3:o.a.z.s.q.CommitProcessor@419] - CommitProcessor exited loop!
2024-08-07 15:23:10,191 [myid:] - ERROR
[QuorumPeer[myid=3](plain=0.0.0.0:2181)(secure=disabled):o.a.z.s.p.Util@166] -
Last transaction was partial.
{code}
Resulting state: ephemeral znode held by 10.128.31.11:42544 was correctly
removed on zookeeper-0 and zookeeper-1 but lingers on zookeeper-2.
> Learner.syncWithLeader got NullPointerException
> -----------------------------------------------
>
> Key: ZOOKEEPER-4394
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4394
> Project: ZooKeeper
> Issue Type: Bug
> Components: server
> Affects Versions: 3.7.0
> Environment: ZooKeeper 3.7.0
> Reporter: Liu Haifeng
> Priority: Major
> Labels: pull-request-available
> Time Spent: 0.5h
> Remaining Estimate: 0h
>
> ZooKeeper follower node encountered NullPointerException during
> syncWithLeader.
> Logs indicate that the follower has received NEWLEADER packet between a
> PROPOSAL packet and it's corresponding COMMIT packet. The NEWLEADER packet
> leads to packetsNotCommitted.clear(), yet the COMMIT packet still wants to do
> packetsNotCommitted.peekFirst() to get the former PROPOSAL packet, and the
> later if-statement raised NPE.
> {code:java}
> case Leader.COMMIT:
> case Leader.COMMITANDACTIVATE:
> pif = packetsNotCommitted.peekFirst();
> if (pif.hdr.getZxid() == qp.getZxid() && qp.getType() ==
> Leader.COMMITANDACTIVATE) {
> // ...
> }{code}
> After look into the Leader side, I found:
> # LearnerHandler.syncFollower queues packets with zxid <= maxCommittedLog
> (PROPOSAL/COMMIT pairs);
> # Leader.startForwarding queues toBeApplied packets(PROPOSAL/COMMIT pairs);
> # Leader.startForwarding queues outstandingProposals packets(PROSOAL only);
> # LeanerHandler.run sends NEWLEADER message.
> Seams if the outstandingProposals is not empty at the certain moment, the
> follower could then receive PROPOSAL/NEWLEADER/COMMIT packets in order.
> The follower will retry from LOOKING again and is expected to be succeed at
> last, however, under heavy load it may be too many retries. Further more, I
> my case the follower has to sync data from leader's disk, and start over
> again after the NPE(prior sync not flushed?), which may harm the leader.
> I don't know if it is designed so or not, but consider the performance, can
> we at least avoid wasting of network/disk IO?
--
This message was sent by Atlassian Jira
(v8.20.10#820010)