The Accumulo file garbage collection mechanism is designed to fail safe to only delete files it knows are no longer in use. It also tries to do this with minimal interaction with the hdfs name node (so, no scanning the entire file system to find files). It's possible that in some circumstances, servers can crash in a way that leaves a file on the file system that Accumulo is no longer using but Accumulo does not have evidence off its existence in order to know to clean it up. This is a preferred failure scenario than accidentally aggressively deleting files that could still be in use.
My recommendation is to periodically check your file system for such orphaned files, and determine if you wish to delete them based on their age or content. These should only appear after a server failure, so you could perform such tasks during triage/investigation of whatever failure occurred when it occurs in your system. You could also write a small trivial monitoring service to identify old unreferenced files and report them to you by whatever means you prefer. Since these should only appear after an unexpected failure, it's hard to provide a general solution within Accumulo itself. On Wed, Jun 29, 2022, 07:54 Hart, Andrew via user <user@accumulo.apache.org> wrote: > Hi, > > > > I have some rfiles in hdfs that aren’t referenced in the accumulo.metadata. > > So there will be a file like 8500000000 2022-02-02 11:59 > /accumulo/tables/3/t-1234567/Cabcdef.rf > > but grep -t accumulo.metadata Cabcdef.rf doesn’t find anything. > > > > Is there any way run the gc process so that it cleans up the orphan rfiles? > > > > And. > > Public >