August 17, 2018 1:09 AM, "Andrew Udvare" <audv...@gmail.com> wrote:

> The whitelist is the biggest work in progress right now. Most of what it 
> lists from /etc for me is
> /etc/config-archive which AFAIK is not managed by Portage at all although 
> Portage will place old
> files there? I don't use the feature because my /etc is controlled by Git. 
> The stuff listed in
> /var/ is pretty accurate as there's a lot of old website cruft and this 
> computer does not serve
> anything like that anymore.

Well, for example I use eselect-repository which puts repos in /var/dbr/repos, 
I put gentoo tree in
there as well and the whole tree is suggested for deletion.
A solution would be to read /etc/portage/repos.conf file(s) for repos location 
during the runtime
detection, or use portageq interface.
Or tell people to whitelist manually their repos location when the config file 
will be available ;)

You could add in whitelist directories containing a .keep file, although I'm 
not sure how to
specify it.
Same goes for git repositories, I’d rather delete a whole git repo or nothing 
at all inside, so
adding a rule which can interprets "pick parent dir of a .git dir to suggest 
deletion, ignore all
children of said parent".

> The idea is to move to everything in the whitelist.c file to a declarative 
> (no code unless you
> count RE) configuration file. I have not decided on a format but I am leaning 
> towards INI-style
> because GLib2 has a parser for that built-in. The config file will specify 
> exact paths, RE, and
> globs. There will be a default dynamic list generated at runtime based on 
> what packages you have
> installed (as gcruft had this feature).

That will be nice, waiting for it ;) Something basic might be enough for making 
batches of test
before choosing a definite format.

>> I also caught some wrongly listed files because of the multilib system with 
>> /lib symlink.
>> For example, dhcpcd declared /lib/dhcpcd/dhcpcd-hooks, thus the realpath 
>> /lib64/dhcpcd/dhcpcd-hooks
>> was listed in the removal suggestion. This should be fixed with profile 17.1
> 
> The /lib vs /lib64 issue will be resolved in a later version. I think I need 
> to use lstat()
> everywhere instead of stat(), or I can call realpath() prior to storing 
> values in the set. This
> file should be whitelisted, but only if you have dhcpcd installed (I've long 
> since moved to dhcpd).

I’m in favor of the realpath suggestion, this will be useful for any symlinked 
accessed path.

>> The log is so huge at the moment it is useless for me :/
>> 
>> % wc -l out.log
>> 461575 out.log
> 
> Any thoughts on how to simplify analysis?

A few, but I’m not sure if I have much which are /universal/ in gentoo systems.
Do you plan to integrate the sorting part in gcrud directly?
If so, I’d suggest bringing /usr/* stuff first to show, because un-owned files 
should be
exceptions.
Same goes for /lib, but stuff like kernel modules should be treated carefully, 
we can either
whitelist the whole /lib{,32,64}/modules, or try being smart and select old 
kernel modules only.
This might be tricky given the number of ways someone can manage them.

Also, here is small analysis of files locations by gcrud.

% cut -d/ -f2 out.log|uniq -c
295 etc
3309 lib64
1178 lib
13 opt
39586 usr
417194 var

/var containing my different repos, its logical it contains most occurences.
Next goes usr, containing another lib{,32,64} schema with /usr/lib pointing to 
/usr/lib64, with go
packages installed (in /usr/lib64/go).
With these informations, I suppose most will disappear when using 
realpath/switching to 17.1
profile.

Thanks for your work, this will probably a excellent tool in a few commits ;)

Regards,
Corentin “Nado” Pazdera

Reply via email to