I'll tar them up and see what I can find! Thanks.
On 03/17/2016 08:18 PM, Michael Wall wrote:
Andrew,
Sounds a lot like https://issues.apache.org/jira/browse/ACCUMULO-4157.
I'll look to see if what you describe could also happen with this
bug. If you still have the gc logs, can you look for a message like
"Removing WAL for offline server" with the uuid?
Mike
On Tue, Mar 8, 2016 at 11:28 AM, Andrew Hulbert <[email protected]
<mailto:[email protected]>> wrote:
Hi folks,
We experienced a problem this morning with a recovery on 1.6.1
that went something like this:
FileNotFoundException: File does not exist:
hdfs:///accumulo/recovery/<uuid>/failed/data
at Tablet.java:1410
at Tablet.java:1233
etc.
at TabletServer:2923
Interestingly enough, at hdfs:///accumulo/recovery/<uuid>/failed
was a 0 byte file, not a directory...and it was preventing tablets
from getting assigned (I am not sure what caused the original
failure, but I believe what happened is a tserver node was going
down...the master indicated it was trying to shutdown the a
tserver which was so bad off someone just rekicked the node).
I looked through the fixes for 1.6.2,3,4,5 but didn't see anything
related on the release notes pages but I haven't gone through all
the tickets yet. I haven't been able to get anyone to upgrade to
1.6.5 yet and perhaps its already fixed.
Just wondering if that's something that has been seen before?
In order to fix it I just deleted the failed file and it proceeded
Thanks!
Andrew