On Sun, Apr 25, 2010 at 01:18:25PM +0200, Angelo Arrifano wrote:
> Hello developers developers and developers,
> 
> Ever wondered how much crap is left in your X-years old Gentoo box?
> 
> I just developed a python utility to efficiently find orphaned files in
> the system. By orphaned files I mean the files that are present on
> system directories and don't belong to any installed package.
> 
> The package builds a virtual filesystem (cache) on the RAM using python
> hash tables. Then it uses the cache to find the ownership of files
> inside user-specified dirs.
> 
> Building the cache takes less than 10 seconds here in a system with 1366
> installed packages.
> 
> This is not intended to be a finished program yet, I'm looking forward
> for your constructive commentaries.

You're going to want to do realpathing here... also you'll need to 
handle syms, and spaces are allowed in paths.  I'd personally suggest 
using one of the PM api's for this.

Part of the reason I advise poking at the PM apis is that it covers up 
some of the nastier details w/ contents and others w/ parsing; simple 
example,

python -c "
import sys
from pkgcore.config import load_config
from pkgcore.fs import contents, livefs
contents = contents.contentsSet()
for pkg in load_config().get_default('domain').named_repos['vdb']:
  contents.update(pkg.contents);
stream = (x for x in livefs.iter_scan(sys.argv[1]) if x not in 
contents)
print '\n'.join(map(str, sorted(stream)))
" desired-path

Note also that's a *very* quick writing.  I'd personally look at 
serializing the sorted lists to disk for both streams (what contents 
says is on disk vs what is on disk), and then lockstep walking the 
lists; via that you can keep the memory usage down.

~harring

Attachment: pgpMnQ4d4ND2R.pgp
Description: PGP signature

Reply via email to