I would suspect that the metadata table became corrupted when the system went 
unstable and two tablet servers somehow ended up both thinking that they were 
responsible for the same extents(s)  This should not be because of the balancer 
running.

If you scan the accumulo.metadata table using the shell (scan -t 
accumulo.metadata -c loc) or (scan -t accumulo.metadata -c loc -b 
[TABLE_ID#]:[EXTENT])

There will be duplicated loc entries.

I am uncertain on the best way to fix this and do not have a place to try 
things out, but possible actions.

Shutdown / bounce the tservers that have the duplicated assignments – you could 
start with just one and see what happens. When the tservers go offline – the 
tablets should be reassigned and maybe only one (re)assignment will occur.

Try bouncing the manager (master)

If those don’t work, then a very aggressive / dangerous / only as a last resort:

Delete the specific loc rows from the metadata table (delete [row_id] loc 
[value] -t accumulo.metadata)  This will cause a future entry in the zookeeper 
– to get that to reassign it might be enough to bounce the master, or you may 
need to shutdown / restart the cluster.

Ed Coleman

From: Ligade, Shailesh [USA] <ligade_shail...@bah.com>
Sent: Tuesday, April 12, 2022 8:36 AM
To: user@accumulo.apache.org
Subject: Accumulo 1.10.0

Hello, Last weekend we ran out of hdfs space πŸ™ all volumes were 100% yeah it 
was crazy. This accumulo has many tables with good data.

Although accumulo was up it had 3 unsassigned tablets

So i added few nodes to hdfs/accumulo, now hdfs capacity is 33% empty. I issued 
hdfs rebalance command (just in case) So all good. Accumulo unassigned tablets 
went away but tables are show no assigned tablets on the accumulo monitor.

On the active master i am seeing error

ERROR: Error processing table state for store Normal Tablets
java.langRuntimeexception: 
org.apache.accumulo.server.master.state.TabletLocationState$BadLocationStateException:
 found two locations for the same extent xxxxxxxx

Question is i am getting this because balancer is running and once it finished 
it will recover? What can be done to save this cluster?

Thanks

-S

Reply via email to