Re: Issue with second/third GC processes in a cluster error spam/OoM

Dave Marion Tue, 27 Aug 2024 07:18:56 -0700

Restarting the secondary GC processes is likely the easiest thing to do. If you 
can't identify them, then you should be able to restart all of the GC 
processes. Accumulo can operate without the GC process for some period of time, 
but it's advised to keep it running.


On 2024/08/27 12:48:21 Craig Portoghese wrote:
> Thanks Dave! Are there any mitigations I can employ to work around this
> until 2.1.4 is released? I suppose on the standby servers I can schedule a
> cronjob to restart the GC process every few hours. I'm not familiar with
> how long Accumulo can operate without a GC in general, so maybe that's
> something I should test for my particular database size/use.
> 
> On Mon, Aug 26, 2024 at 1:39 PM Dave Marion <dlmar...@apache.org> wrote:
> 
> > Thanks for reporting this. Based on the information you provided I was
> > able to create https://github.com/apache/accumulo/pull/4838. It appears
> > that the Manager, Monitor, and SimpleGarbageCollector are creating multiple
> > instances of ServiceLock when in a loop waiting to acquire the lock (when
> > they are the standby node). The ServiceLock constructor creates a Watcher
> > in the ZooKeeper client, which is likely causing the problem you are
> > having. The Manager and Monitor operate a little differently and thus do
> > not exhibit the same OOME problem.
> >
> > On 2024/08/26 12:13:50 Craig Portoghese wrote:
> > > Wasn't sure if this was bug territory or an issue with cluster
> > > configuration.
> > >
> > > In my dev environment, I have a 5-server AWS EMR cluster using Accumulo
> > > 2.1.2, Hadoop 3.3.6, and Zookeeper 3.5.10.The cluster is in high
> > > availability mode so there are 3 primary nodes with Zookeeper running. On
> > > the primary nodes I run the manager, monitor, and gc processes. On the 2
> > > core nodes (with DataNode on them) I run just tablet servers.
> > >
> > > The manager and monitor processes on the 2nd and 3rd servers are fine, no
> > > problems about not being the leader for their process. However, the 2nd
> > and
> > > 3rd GC processes will repeatedly complain in a DEBUG "Failed to acquire
> > > lock". It will complain that there is already a gc lock, and then create
> > an
> > > ephemeral node #0000000001, then #0000000002, etc. After about 8 hours of
> > > this complaint loop, it will turn into an error "Called
> > > determineLockOwnership() when ephemeralNodeName == null", which it spams
> > > forever, filling up the server and eventually killing the server.
> > >
> > > This has happened in multiple environments. Is it an issue with GC's
> > > ability to hold elections? Should I be putting the standby GC processes
> > on
> > > a different node than the one running one of the zookeepers? Below are
> > > samples of the two log types:
> > >
> > > 2024-08-24T15:28:03,292 [gc.SimpleGarbageCollector] INFO : Trying to
> > > acquire ZooKeeper lock for garbage collector
> > > 2024-08-24T15:28:03,330 [metrics.MetricsUtil] INFO : Metric producer
> > > ThriftMetrics initialize
> > > 2024-08-24T15:28:03,335 [rpc.TServerUtils] DEBUG: Instantiating unsecure
> > > custom half-async Thrift server
> > > 2024-08-24T15:28:03,348 [gc.SimpleGarbageCollector] DEBUG: Starting
> > garbage
> > > collector listening on coreNode1.example.domain:9998
> > > 2024-08-24T15:28:59,694 [zookeeper.ServiceLock] DEBUG:
> > > [zlock#a1a993fd-9086-4473-9545-113a865ca539#] Ephemeral node
> > >
> > /accumulo/f73e29fc-62b3-44bf-8d6d-694a6a262a98/gc/lock/zlock#a1a993fd-9086-4473-9545-113a865ca539#0000000057
> > > created
> > > 2024-08-24T15:28:59,694 [zookeeper.ServiceLock] DEBUG:
> > > [zlock#a1a993fd-9086-4473-9545-113a865ca539#] Setting watcher on
> > >
> > /accumulo/f73e29fc-62b3-44bf-8d6d-694a6a262a98/gc/lock/zlock#a1a993fd-9086-4473-9545-113a865ca539#0000000057
> > > 2024-08-24T15:28:59,695 [zookeeper.ServiceLock] DEBUG:
> > > [zlock#a1a993fd-9086-4473-9545-113a865ca539#] Lock held by another
> > process
> > > with ephemeral node:
> > zlock#85f3ef81-b877-4321-a416-a24faec6f032#0000000000
> > > 2024-08-24T15:28:59,695 [zookeeper.ServiceLock] DEBUG:
> > > [zlock#a1a993fd-9086-4473-9545-113a865ca539#] Establishing watch on prior
> > > node
> > >
> > /accumulo/f73e29fc-62b3-44bf-8d6d-694a6a262a98/gc/lock/zlock#85f3ef81-b877-4321-a416-a24faec6f032#0000000000
> > > 2024-08-24T15:28:59,695 [zookeeper.ServiceLock] DEBUG:
> > > [zlock#a1a993fd-9086-4473-9545-113a865ca539#] Failed to acquire lock in
> > > tryLock(), deleting all at path:
> > >
> > /accumulo/f73e29fc-62b3-44bf-8d6d-694a6a262a98/gc/lock/zlock#a1a993fd-9086-4473-9545-113a865ca539#0000000057
> > > 2024-08-24T15:28:59,697 [gc.SimpleGarbageCollector] DEBUG: Failed to get
> > GC
> > > ZooKeeper lock, will retry
> > >
> > > 2024-08-25T21:48:31,418 [zookeeper.ClientCnxn] ERROR: Error while calling
> > > watcher
> > > java.lang.IllegalStateException: Called determineLockOwnership() when
> > > ephemeralNodeName == null
> > >         at
> > >
> > org.apache.accumulo.core.fate.zookeeper.ServiceLock.determineLockOwnership(ServiceLock.java:274)
> > > ~[accumulo-core-2.1.2.jar:2.1.2]
> > >         at
> > >
> > org.apache.accumulo.core.fate.zookeeper.ServiceLock$1.process(ServiceLock.java:354)
> > > ~[accumulo-core-2.1.2.jar:2.1.2]
> > >         at
> > >
> > org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:532)
> > > ~[zookeeper-3.5.10.jar:3.5.10--1]
> > >         at
> > > org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:507)
> > > ~[zookeeper-3.5.10.jar:3.5.10--1]
> > >
> >
>

Re: Issue with second/third GC processes in a cluster error spam/OoM

Reply via email to