I'll tar them up and see what I can find! Thanks.

On 03/17/2016 08:18 PM, Michael Wall wrote:
Andrew,

Sounds a lot like https://issues.apache.org/jira/browse/ACCUMULO-4157. I'll look to see if what you describe could also happen with this bug. If you still have the gc logs, can you look for a message like "Removing WAL for offline server" with the uuid?

Mike

On Tue, Mar 8, 2016 at 11:28 AM, Andrew Hulbert <[email protected] <mailto:[email protected]>> wrote:

    Hi folks,

    We experienced a problem this morning with a recovery on 1.6.1
    that went something like this:

    FileNotFoundException: File does not exist:
    hdfs:///accumulo/recovery/<uuid>/failed/data

    at Tablet.java:1410
    at Tablet.java:1233
    etc.
    at TabletServer:2923

    Interestingly enough, at hdfs:///accumulo/recovery/<uuid>/failed
    was a 0 byte file, not a directory...and it was preventing tablets
    from getting assigned (I am not sure what caused the original
    failure, but I believe what happened is a tserver node was going
    down...the master indicated it was trying to shutdown the a
    tserver which was so bad off someone just rekicked the node).

    I looked through the fixes for 1.6.2,3,4,5 but didn't see anything
    related on the release notes pages but I haven't gone through all
    the tickets yet. I haven't been able to get anyone to upgrade to
    1.6.5 yet and perhaps its already fixed.

    Just wondering if that's something that has been seen before?

    In order to fix it I just deleted the failed file and it proceeded

    Thanks!

    Andrew



Reply via email to