The Accumulo file garbage collection mechanism is designed to fail safe to
only delete files it knows are no longer in use. It also tries to do this
with minimal interaction with the hdfs name node (so, no scanning the
entire file system to find files). It's possible that in some
circumstances, servers can crash in a way that leaves a file on the file
system that Accumulo is no longer using but Accumulo does not have evidence
off its existence in order to know to clean it up. This is a preferred
failure scenario than accidentally aggressively deleting files that could
still be in use.

My recommendation is to periodically check your file system for such
orphaned files, and determine if you wish to delete them based on their age
or content. These should only appear after a server failure, so you could
perform such tasks during triage/investigation of whatever failure occurred
when it occurs in your system. You could also write a small trivial
monitoring service to identify old unreferenced files and report them to
you by whatever means you prefer. Since these should only appear after an
unexpected failure, it's hard to provide a general solution within Accumulo
itself.

On Wed, Jun 29, 2022, 07:54 Hart, Andrew via user <user@accumulo.apache.org>
wrote:

> Hi,
>
>
>
> I have some rfiles in hdfs that aren’t referenced in the accumulo.metadata.
>
> So there will be a file like   8500000000 2022-02-02 11:59
> /accumulo/tables/3/t-1234567/Cabcdef.rf
>
> but grep -t accumulo.metadata Cabcdef.rf doesn’t find anything.
>
>
>
> Is there any way run the gc process so that it cleans up the orphan rfiles?
>
>
>
> And.
>
> Public
>

Reply via email to