Re: Recovery file versus directory

Andrew Hulbert Sat, 19 Mar 2016 00:15:06 -0700

Looks like the only thing we have in the gc logs are:


DEBUG: deleted [hdfs://../accumulo/wal/<uuid> ...]
DEBUG: Removing sorted WAL hdfs://...<uuid>

I can't tell if they are before or after in time than when I deleted thefile


hdfs://accumulo/wal/<uuid>/failed

Here's the other issue we were looking at:

https://issues.apache.org/jira/browse/ACCUMULO-3727

FYI I originally increased the num WALs up to 8 to help batch writeingest...Now I've modified it only to be for the tables that neededingest instead of the entire cluster, and reset the num WALs for thecluster back to 3, and I haven't had any errors since (3 days). Not surewhy that would be a problem except for the few times that the metadatatable was involved.


Andrew

On 03/18/2016 09:43 AM, Andrew Hulbert wrote:

I'll tar them up and see what I can find! Thanks.

On 03/17/2016 08:18 PM, Michael Wall wrote:

Andrew,

Sounds a lot likehttps://issues.apache.org/jira/browse/ACCUMULO-4157. I'll look to seeif what you describe could also happen with this bug. If you stillhave the gc logs, can you look for a message like "Removing WAL foroffline server" with the uuid?


Mike

On Tue, Mar 8, 2016 at 11:28 AM, Andrew Hulbert <[email protected]<mailto:[email protected]>> wrote:


    Hi folks,

    We experienced a problem this morning with a recovery on 1.6.1
    that went something like this:

    FileNotFoundException: File does not exist:
    hdfs:///accumulo/recovery/<uuid>/failed/data

    at Tablet.java:1410
    at Tablet.java:1233
    etc.
    at TabletServer:2923

    Interestingly enough, at hdfs:///accumulo/recovery/<uuid>/failed
    was a 0 byte file, not a directory...and it was preventing
    tablets from getting assigned (I am not sure what caused the
    original failure, but I believe what happened is a tserver node
    was going down...the master indicated it was trying to shutdown
    the a tserver which was so bad off someone just rekicked the node).

    I looked through the fixes for 1.6.2,3,4,5 but didn't see
    anything related on the release notes pages but I haven't gone
    through all the tickets yet. I haven't been able to get anyone to
    upgrade to 1.6.5 yet and perhaps its already fixed.

    Just wondering if that's something that has been seen before?

    In order to fix it I just deleted the failed file and it proceeded

    Thanks!

    Andrew

Re: Recovery file versus directory

Reply via email to