August 17, 2018 1:09 AM, "Andrew Udvare" <audv...@gmail.com> wrote:
> The whitelist is the biggest work in progress right now. Most of what it > lists from /etc for me is > /etc/config-archive which AFAIK is not managed by Portage at all although > Portage will place old > files there? I don't use the feature because my /etc is controlled by Git. > The stuff listed in > /var/ is pretty accurate as there's a lot of old website cruft and this > computer does not serve > anything like that anymore. Well, for example I use eselect-repository which puts repos in /var/dbr/repos, I put gentoo tree in there as well and the whole tree is suggested for deletion. A solution would be to read /etc/portage/repos.conf file(s) for repos location during the runtime detection, or use portageq interface. Or tell people to whitelist manually their repos location when the config file will be available ;) You could add in whitelist directories containing a .keep file, although I'm not sure how to specify it. Same goes for git repositories, I’d rather delete a whole git repo or nothing at all inside, so adding a rule which can interprets "pick parent dir of a .git dir to suggest deletion, ignore all children of said parent". > The idea is to move to everything in the whitelist.c file to a declarative > (no code unless you > count RE) configuration file. I have not decided on a format but I am leaning > towards INI-style > because GLib2 has a parser for that built-in. The config file will specify > exact paths, RE, and > globs. There will be a default dynamic list generated at runtime based on > what packages you have > installed (as gcruft had this feature). That will be nice, waiting for it ;) Something basic might be enough for making batches of test before choosing a definite format. >> I also caught some wrongly listed files because of the multilib system with >> /lib symlink. >> For example, dhcpcd declared /lib/dhcpcd/dhcpcd-hooks, thus the realpath >> /lib64/dhcpcd/dhcpcd-hooks >> was listed in the removal suggestion. This should be fixed with profile 17.1 > > The /lib vs /lib64 issue will be resolved in a later version. I think I need > to use lstat() > everywhere instead of stat(), or I can call realpath() prior to storing > values in the set. This > file should be whitelisted, but only if you have dhcpcd installed (I've long > since moved to dhcpd). I’m in favor of the realpath suggestion, this will be useful for any symlinked accessed path. >> The log is so huge at the moment it is useless for me :/ >> >> % wc -l out.log >> 461575 out.log > > Any thoughts on how to simplify analysis? A few, but I’m not sure if I have much which are /universal/ in gentoo systems. Do you plan to integrate the sorting part in gcrud directly? If so, I’d suggest bringing /usr/* stuff first to show, because un-owned files should be exceptions. Same goes for /lib, but stuff like kernel modules should be treated carefully, we can either whitelist the whole /lib{,32,64}/modules, or try being smart and select old kernel modules only. This might be tricky given the number of ways someone can manage them. Also, here is small analysis of files locations by gcrud. % cut -d/ -f2 out.log|uniq -c 295 etc 3309 lib64 1178 lib 13 opt 39586 usr 417194 var /var containing my different repos, its logical it contains most occurences. Next goes usr, containing another lib{,32,64} schema with /usr/lib pointing to /usr/lib64, with go packages installed (in /usr/lib64/go). With these informations, I suppose most will disappear when using realpath/switching to 17.1 profile. Thanks for your work, this will probably a excellent tool in a few commits ;) Regards, Corentin “Nado” Pazdera