Looks like the only thing we have in the gc logs are:

DEBUG: deleted [hdfs://../accumulo/wal/<uuid> ...]
DEBUG: Removing sorted WAL hdfs://...<uuid>

I can't tell if they are before or after in time than when I deleted the file

hdfs://accumulo/wal/<uuid>/failed

Here's the other issue we were looking at:

https://issues.apache.org/jira/browse/ACCUMULO-3727

FYI I originally increased the num WALs up to 8 to help batch write ingest...Now I've modified it only to be for the tables that needed ingest instead of the entire cluster, and reset the num WALs for the cluster back to 3, and I haven't had any errors since (3 days). Not sure why that would be a problem except for the few times that the metadata table was involved.

Andrew

On 03/18/2016 09:43 AM, Andrew Hulbert wrote:
I'll tar them up and see what I can find! Thanks.

On 03/17/2016 08:18 PM, Michael Wall wrote:
Andrew,

Sounds a lot like https://issues.apache.org/jira/browse/ACCUMULO-4157. I'll look to see if what you describe could also happen with this bug. If you still have the gc logs, can you look for a message like "Removing WAL for offline server" with the uuid?

Mike

On Tue, Mar 8, 2016 at 11:28 AM, Andrew Hulbert <[email protected] <mailto:[email protected]>> wrote:

    Hi folks,

    We experienced a problem this morning with a recovery on 1.6.1
    that went something like this:

    FileNotFoundException: File does not exist:
    hdfs:///accumulo/recovery/<uuid>/failed/data

    at Tablet.java:1410
    at Tablet.java:1233
    etc.
    at TabletServer:2923

    Interestingly enough, at hdfs:///accumulo/recovery/<uuid>/failed
    was a 0 byte file, not a directory...and it was preventing
    tablets from getting assigned (I am not sure what caused the
    original failure, but I believe what happened is a tserver node
    was going down...the master indicated it was trying to shutdown
    the a tserver which was so bad off someone just rekicked the node).

    I looked through the fixes for 1.6.2,3,4,5 but didn't see
    anything related on the release notes pages but I haven't gone
    through all the tickets yet. I haven't been able to get anyone to
    upgrade to 1.6.5 yet and perhaps its already fixed.

    Just wondering if that's something that has been seen before?

    In order to fix it I just deleted the failed file and it proceeded

    Thanks!

    Andrew




Reply via email to