Hi Chris, On Wed, Jan 11, 2017 at 05:04:11PM +0000, Chris Lamb wrote: > I've just removed some duplicates in a package [0] with symlinks, but > I was wondering if I am missing a page or feature where I can see all > "my" offenses against duplicated content, preferably ordered by (for > example) the number of bytes duplicated?
Great. The service is meant to show the low hanging fruit of archive space waste. Maintainer information is not currently extracted, which makes creating the per-maintainer page difficult. Actually presenting the data is the hard part. Significant effort has been spent in making the relevant computations "fast enough" and a port to postgresql is stalled, because I was unable to obtain decent performance. If you have concrete ideas and are interested in helping implement them, that'd be great, but we should probably take this off d-devel then. I didn't consider per-maintainer views important yet, because I tend to temporarily focus on individual packages and avoid becoming a long term maintainer. > Seeing the worst offenders in the Debian archive would also be > fascinating. This is partially possible already. The site exports a data file for use with packages.qa.d.o (a port to tracker.d.o is still outstanding #756765). It is available at https://dedup.debian.net/static/ptslist.txt. If you got interested and are a DD, you can simply ssh delfin.debian.org sqlite3 /srv/dedup.debian.org/dedup.sqlite3 and start playing around. If you are not a DD and have a close mirror around, creating that data file is a simple matter of downloading the mirror and should finish within 12h on a fast machine. You can find detailed instructions in the README. The README also has a few more interesting queries. I tried to do more interesting things with this data or find more interesting hash functions. ssdeep turned out to not work well. Now I mostly moved on and keep maintaining it as a diagnostics tool. In any case, hacking on the code should be relatively easy as you can simply import /var/cache/apt/archives as a sample population for testing locally. All the code is tailored to being easily runnable locally. Happy hacking and if you have questions just ask (irc or mail is fine) Helmut