For the benefit of Google and/or future me, and with huge thanks to Ed Coleman, here’s a quick summary of an issue we hit with Accumulo 1.7.0 and the fix. Details are in Slack but with a few red herrings (thanks to me). Some of this is fat-fingered so apologies for any typos:
We recently needed to bounce our moderate (19 node) cluster (log4j on other stuff), but Accumulo failed to restart. Four of the nodes had been down for some time (root cause unknown). Symptoms 1) Accumulo monitor showed list of tables but "-" against every entry 2) Accumulo files looked ok in HDFS 3) scan -t accumulo.root (debug on) in accumulo shell gave “Failed to locate tablet for table : +r row :” 4) There were some Zookeeper warnings in some logs (I forget precisely which) but they weren't hugely informative - ConnectionLoss for /accumulo/{uuid}/root_tablet/walogs. This turns out to be critical, but I didn't realise it at the time. 5) Zookeeper nodes showed that a tserver should host the root tablet (/accumulo/{id}/root_tablet/location), but that tserver did not have a lock for the root tablet (/accumulo/{id}/tservers/mytservername.domain:9997/zlock-00000000) 6) Using the zookeeper cli, ls /accumulo/{id}/root_tablet/walogs bombed out with familiar looking ConnectionLoss, although with some more helpful info "Packet len is out of range" Cause Zookeeper clients (CLI or Accumulo tserver) are failing to list znode with large numbers of children due to insufficient buffer space. See the docs on jute.maxbuffer here - https://zookeeper.apache.org/doc/r3.7.0/zookeeperAdmin.html#Unsafe+Options Quite why there were so many children of the walogs node is unknown, but may have been due to the four inactive tservers Fix Set "-Djute.maxbuffer=big_value" for all Accumulo processes seemed to fix things. For me, big_value was around 8000000 (i.e. 8MB). Accumulo came back slowly, found all its data files and then the number of children of the zk walogs node dropped substantially.