I suspect that compliance with GDPR would require the data to be
stored minimally.
It seems reasonable to me that a 24-hour window would reduce most
repeat-downloads.
If you stream the request log and reduce to (ip,package,version), it
will be minimal.
I think it would fit into memory, e.g. 10 million unique IP adresses x
100 packages x 40 bytes = 40 GB
The program code could ideally be generalized and used by other distros as well.

On Sat, May 3, 2025 at 10:43 AM Philipp Kern <p...@philkern.de> wrote:
>
> On 2025-05-03 03:35, Otto Kekäläinen wrote:
> >> I'm interested in package popularity. I'm aware of popcon
> >> (https://popcon.debian.org/), but I'm more interested in actual
> >> downloads.
> >
> > I am also interested in usage statistics. I feel it is much more
> > meaningful to work on packages that I know how have a lot of users.
> >
> > While neither popcon of download stats are accurate, they still show
> > trends and relative numbers which can be used to make useful
> > conclusions. I would be glad to see if people could share ideas on
> > what stats we could collect and publish instead of just pointing out
> > flaws in various stats.
>
> The problem is that we currently do not want to retain this data. It'd
> require a clear measure of usefulness, not just a "it would be nice if
> we had it". And there would need to be actual criteria of what we would
> be interested in. Raw download count? Some measure of bucketing by
> source IP or not? What about container/hermetic builders fetching the
> same ancient package over and over again from snapshot? Does the version
> matter?
>
> In the end there would probably need to be a proof of concept of a log
> processor that's privacy-friendly and gives us the metrics that we
> actually want. Hence my question what these metrics are for, except for
> a fuzzy feeling of "working on the right priorities". There will be lots
> of packages that are rarely downloaded and still important.
>
> Everyone can ask "please just retain all logs and we will do analysis on
> them later". Right now it'd be infeasible to get the statistics from the
> mirrors, and we could at most get statistics for deb.d.o. To give a
> sense of scale: We are sampling 1% of cache hits and all errors right
> now. That's 6.7 GB/d uncompressed (500 M/d compressed). Back of the
> envelope math says that'd be 600 GB/d of raw syslog log traffic. We
> should have a very good reason for collecting this much data.
>
> Kind regards
> Philipp Kern
>

Reply via email to