I suspect that compliance with GDPR would require the data to be stored minimally. It seems reasonable to me that a 24-hour window would reduce most repeat-downloads. If you stream the request log and reduce to (ip,package,version), it will be minimal. I think it would fit into memory, e.g. 10 million unique IP adresses x 100 packages x 40 bytes = 40 GB The program code could ideally be generalized and used by other distros as well.
On Sat, May 3, 2025 at 10:43 AM Philipp Kern <p...@philkern.de> wrote: > > On 2025-05-03 03:35, Otto Kekäläinen wrote: > >> I'm interested in package popularity. I'm aware of popcon > >> (https://popcon.debian.org/), but I'm more interested in actual > >> downloads. > > > > I am also interested in usage statistics. I feel it is much more > > meaningful to work on packages that I know how have a lot of users. > > > > While neither popcon of download stats are accurate, they still show > > trends and relative numbers which can be used to make useful > > conclusions. I would be glad to see if people could share ideas on > > what stats we could collect and publish instead of just pointing out > > flaws in various stats. > > The problem is that we currently do not want to retain this data. It'd > require a clear measure of usefulness, not just a "it would be nice if > we had it". And there would need to be actual criteria of what we would > be interested in. Raw download count? Some measure of bucketing by > source IP or not? What about container/hermetic builders fetching the > same ancient package over and over again from snapshot? Does the version > matter? > > In the end there would probably need to be a proof of concept of a log > processor that's privacy-friendly and gives us the metrics that we > actually want. Hence my question what these metrics are for, except for > a fuzzy feeling of "working on the right priorities". There will be lots > of packages that are rarely downloaded and still important. > > Everyone can ask "please just retain all logs and we will do analysis on > them later". Right now it'd be infeasible to get the statistics from the > mirrors, and we could at most get statistics for deb.d.o. To give a > sense of scale: We are sampling 1% of cache hits and all errors right > now. That's 6.7 GB/d uncompressed (500 M/d compressed). Back of the > envelope math says that'd be 600 GB/d of raw syslog log traffic. We > should have a very good reason for collecting this much data. > > Kind regards > Philipp Kern >