On Sun, Apr 25, 2010 at 01:18:25PM +0200, Angelo Arrifano wrote: > Hello developers developers and developers, > > Ever wondered how much crap is left in your X-years old Gentoo box? > > I just developed a python utility to efficiently find orphaned files in > the system. By orphaned files I mean the files that are present on > system directories and don't belong to any installed package. > > The package builds a virtual filesystem (cache) on the RAM using python > hash tables. Then it uses the cache to find the ownership of files > inside user-specified dirs. > > Building the cache takes less than 10 seconds here in a system with 1366 > installed packages. > > This is not intended to be a finished program yet, I'm looking forward > for your constructive commentaries.
You're going to want to do realpathing here... also you'll need to handle syms, and spaces are allowed in paths. I'd personally suggest using one of the PM api's for this. Part of the reason I advise poking at the PM apis is that it covers up some of the nastier details w/ contents and others w/ parsing; simple example, python -c " import sys from pkgcore.config import load_config from pkgcore.fs import contents, livefs contents = contents.contentsSet() for pkg in load_config().get_default('domain').named_repos['vdb']: contents.update(pkg.contents); stream = (x for x in livefs.iter_scan(sys.argv[1]) if x not in contents) print '\n'.join(map(str, sorted(stream))) " desired-path Note also that's a *very* quick writing. I'd personally look at serializing the sorted lists to disk for both streams (what contents says is on disk vs what is on disk), and then lockstep walking the lists; via that you can keep the memory usage down. ~harring
pgpMnQ4d4ND2R.pgp
Description: PGP signature