Thanks for reporting this. Based on the information you provided I was able to create https://github.com/apache/accumulo/pull/4838. It appears that the Manager, Monitor, and SimpleGarbageCollector are creating multiple instances of ServiceLock when in a loop waiting to acquire the lock (when they are the standby node). The ServiceLock constructor creates a Watcher in the ZooKeeper client, which is likely causing the problem you are having. The Manager and Monitor operate a little differently and thus do not exhibit the same OOME problem.
On 2024/08/26 12:13:50 Craig Portoghese wrote: > Wasn't sure if this was bug territory or an issue with cluster > configuration. > > In my dev environment, I have a 5-server AWS EMR cluster using Accumulo > 2.1.2, Hadoop 3.3.6, and Zookeeper 3.5.10.The cluster is in high > availability mode so there are 3 primary nodes with Zookeeper running. On > the primary nodes I run the manager, monitor, and gc processes. On the 2 > core nodes (with DataNode on them) I run just tablet servers. > > The manager and monitor processes on the 2nd and 3rd servers are fine, no > problems about not being the leader for their process. However, the 2nd and > 3rd GC processes will repeatedly complain in a DEBUG "Failed to acquire > lock". It will complain that there is already a gc lock, and then create an > ephemeral node #0000000001, then #0000000002, etc. After about 8 hours of > this complaint loop, it will turn into an error "Called > determineLockOwnership() when ephemeralNodeName == null", which it spams > forever, filling up the server and eventually killing the server. > > This has happened in multiple environments. Is it an issue with GC's > ability to hold elections? Should I be putting the standby GC processes on > a different node than the one running one of the zookeepers? Below are > samples of the two log types: > > 2024-08-24T15:28:03,292 [gc.SimpleGarbageCollector] INFO : Trying to > acquire ZooKeeper lock for garbage collector > 2024-08-24T15:28:03,330 [metrics.MetricsUtil] INFO : Metric producer > ThriftMetrics initialize > 2024-08-24T15:28:03,335 [rpc.TServerUtils] DEBUG: Instantiating unsecure > custom half-async Thrift server > 2024-08-24T15:28:03,348 [gc.SimpleGarbageCollector] DEBUG: Starting garbage > collector listening on coreNode1.example.domain:9998 > 2024-08-24T15:28:59,694 [zookeeper.ServiceLock] DEBUG: > [zlock#a1a993fd-9086-4473-9545-113a865ca539#] Ephemeral node > /accumulo/f73e29fc-62b3-44bf-8d6d-694a6a262a98/gc/lock/zlock#a1a993fd-9086-4473-9545-113a865ca539#0000000057 > created > 2024-08-24T15:28:59,694 [zookeeper.ServiceLock] DEBUG: > [zlock#a1a993fd-9086-4473-9545-113a865ca539#] Setting watcher on > /accumulo/f73e29fc-62b3-44bf-8d6d-694a6a262a98/gc/lock/zlock#a1a993fd-9086-4473-9545-113a865ca539#0000000057 > 2024-08-24T15:28:59,695 [zookeeper.ServiceLock] DEBUG: > [zlock#a1a993fd-9086-4473-9545-113a865ca539#] Lock held by another process > with ephemeral node: zlock#85f3ef81-b877-4321-a416-a24faec6f032#0000000000 > 2024-08-24T15:28:59,695 [zookeeper.ServiceLock] DEBUG: > [zlock#a1a993fd-9086-4473-9545-113a865ca539#] Establishing watch on prior > node > /accumulo/f73e29fc-62b3-44bf-8d6d-694a6a262a98/gc/lock/zlock#85f3ef81-b877-4321-a416-a24faec6f032#0000000000 > 2024-08-24T15:28:59,695 [zookeeper.ServiceLock] DEBUG: > [zlock#a1a993fd-9086-4473-9545-113a865ca539#] Failed to acquire lock in > tryLock(), deleting all at path: > /accumulo/f73e29fc-62b3-44bf-8d6d-694a6a262a98/gc/lock/zlock#a1a993fd-9086-4473-9545-113a865ca539#0000000057 > 2024-08-24T15:28:59,697 [gc.SimpleGarbageCollector] DEBUG: Failed to get GC > ZooKeeper lock, will retry > > 2024-08-25T21:48:31,418 [zookeeper.ClientCnxn] ERROR: Error while calling > watcher > java.lang.IllegalStateException: Called determineLockOwnership() when > ephemeralNodeName == null > at > org.apache.accumulo.core.fate.zookeeper.ServiceLock.determineLockOwnership(ServiceLock.java:274) > ~[accumulo-core-2.1.2.jar:2.1.2] > at > org.apache.accumulo.core.fate.zookeeper.ServiceLock$1.process(ServiceLock.java:354) > ~[accumulo-core-2.1.2.jar:2.1.2] > at > org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:532) > ~[zookeeper-3.5.10.jar:3.5.10--1] > at > org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:507) > ~[zookeeper-3.5.10.jar:3.5.10--1] >