Thanks for reporting this. Based on the information you provided I was able to 
create https://github.com/apache/accumulo/pull/4838. It appears that the 
Manager, Monitor, and SimpleGarbageCollector are creating multiple instances of 
ServiceLock when in a loop waiting to acquire the lock (when they are the 
standby node). The ServiceLock constructor creates a Watcher in the ZooKeeper 
client, which is likely causing the problem you are having. The Manager and 
Monitor operate a little differently and thus do not exhibit the same OOME 
problem.

On 2024/08/26 12:13:50 Craig Portoghese wrote:
> Wasn't sure if this was bug territory or an issue with cluster
> configuration.
> 
> In my dev environment, I have a 5-server AWS EMR cluster using Accumulo
> 2.1.2, Hadoop 3.3.6, and Zookeeper 3.5.10.The cluster is in high
> availability mode so there are 3 primary nodes with Zookeeper running. On
> the primary nodes I run the manager, monitor, and gc processes. On the 2
> core nodes (with DataNode on them) I run just tablet servers.
> 
> The manager and monitor processes on the 2nd and 3rd servers are fine, no
> problems about not being the leader for their process. However, the 2nd and
> 3rd GC processes will repeatedly complain in a DEBUG "Failed to acquire
> lock". It will complain that there is already a gc lock, and then create an
> ephemeral node #0000000001, then #0000000002, etc. After about 8 hours of
> this complaint loop, it will turn into an error "Called
> determineLockOwnership() when ephemeralNodeName == null", which it spams
> forever, filling up the server and eventually killing the server.
> 
> This has happened in multiple environments. Is it an issue with GC's
> ability to hold elections? Should I be putting the standby GC processes on
> a different node than the one running one of the zookeepers? Below are
> samples of the two log types:
> 
> 2024-08-24T15:28:03,292 [gc.SimpleGarbageCollector] INFO : Trying to
> acquire ZooKeeper lock for garbage collector
> 2024-08-24T15:28:03,330 [metrics.MetricsUtil] INFO : Metric producer
> ThriftMetrics initialize
> 2024-08-24T15:28:03,335 [rpc.TServerUtils] DEBUG: Instantiating unsecure
> custom half-async Thrift server
> 2024-08-24T15:28:03,348 [gc.SimpleGarbageCollector] DEBUG: Starting garbage
> collector listening on coreNode1.example.domain:9998
> 2024-08-24T15:28:59,694 [zookeeper.ServiceLock] DEBUG:
> [zlock#a1a993fd-9086-4473-9545-113a865ca539#] Ephemeral node
> /accumulo/f73e29fc-62b3-44bf-8d6d-694a6a262a98/gc/lock/zlock#a1a993fd-9086-4473-9545-113a865ca539#0000000057
> created
> 2024-08-24T15:28:59,694 [zookeeper.ServiceLock] DEBUG:
> [zlock#a1a993fd-9086-4473-9545-113a865ca539#] Setting watcher on
> /accumulo/f73e29fc-62b3-44bf-8d6d-694a6a262a98/gc/lock/zlock#a1a993fd-9086-4473-9545-113a865ca539#0000000057
> 2024-08-24T15:28:59,695 [zookeeper.ServiceLock] DEBUG:
> [zlock#a1a993fd-9086-4473-9545-113a865ca539#] Lock held by another process
> with ephemeral node: zlock#85f3ef81-b877-4321-a416-a24faec6f032#0000000000
> 2024-08-24T15:28:59,695 [zookeeper.ServiceLock] DEBUG:
> [zlock#a1a993fd-9086-4473-9545-113a865ca539#] Establishing watch on prior
> node
> /accumulo/f73e29fc-62b3-44bf-8d6d-694a6a262a98/gc/lock/zlock#85f3ef81-b877-4321-a416-a24faec6f032#0000000000
> 2024-08-24T15:28:59,695 [zookeeper.ServiceLock] DEBUG:
> [zlock#a1a993fd-9086-4473-9545-113a865ca539#] Failed to acquire lock in
> tryLock(), deleting all at path:
> /accumulo/f73e29fc-62b3-44bf-8d6d-694a6a262a98/gc/lock/zlock#a1a993fd-9086-4473-9545-113a865ca539#0000000057
> 2024-08-24T15:28:59,697 [gc.SimpleGarbageCollector] DEBUG: Failed to get GC
> ZooKeeper lock, will retry
> 
> 2024-08-25T21:48:31,418 [zookeeper.ClientCnxn] ERROR: Error while calling
> watcher
> java.lang.IllegalStateException: Called determineLockOwnership() when
> ephemeralNodeName == null
>         at
> org.apache.accumulo.core.fate.zookeeper.ServiceLock.determineLockOwnership(ServiceLock.java:274)
> ~[accumulo-core-2.1.2.jar:2.1.2]
>         at
> org.apache.accumulo.core.fate.zookeeper.ServiceLock$1.process(ServiceLock.java:354)
> ~[accumulo-core-2.1.2.jar:2.1.2]
>         at
> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:532)
> ~[zookeeper-3.5.10.jar:3.5.10--1]
>         at
> org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:507)
> ~[zookeeper-3.5.10.jar:3.5.10--1]
> 

Reply via email to