Andrew, Sounds a lot like https://issues.apache.org/jira/browse/ACCUMULO-4157. I'll look to see if what you describe could also happen with this bug. If you still have the gc logs, can you look for a message like "Removing WAL for offline server" with the uuid?
Mike On Tue, Mar 8, 2016 at 11:28 AM, Andrew Hulbert <[email protected]> wrote: > Hi folks, > > We experienced a problem this morning with a recovery on 1.6.1 that went > something like this: > > FileNotFoundException: File does not exist: > hdfs:///accumulo/recovery/<uuid>/failed/data > > at Tablet.java:1410 > at Tablet.java:1233 > etc. > at TabletServer:2923 > > Interestingly enough, at hdfs:///accumulo/recovery/<uuid>/failed was a 0 > byte file, not a directory...and it was preventing tablets from getting > assigned (I am not sure what caused the original failure, but I believe > what happened is a tserver node was going down...the master indicated it > was trying to shutdown the a tserver which was so bad off someone just > rekicked the node). > > I looked through the fixes for 1.6.2,3,4,5 but didn't see anything related > on the release notes pages but I haven't gone through all the tickets yet. > I haven't been able to get anyone to upgrade to 1.6.5 yet and perhaps its > already fixed. > > Just wondering if that's something that has been seen before? > > In order to fix it I just deleted the failed file and it proceeded > > Thanks! > > Andrew >
