On 2025-05-03 03:35, Otto Kekäläinen wrote:
I'm interested in package popularity. I'm aware of popcon
(https://popcon.debian.org/), but I'm more interested in actual
downloads.

I am also interested in usage statistics. I feel it is much more
meaningful to work on packages that I know how have a lot of users.

While neither popcon of download stats are accurate, they still show
trends and relative numbers which can be used to make useful
conclusions. I would be glad to see if people could share ideas on
what stats we could collect and publish instead of just pointing out
flaws in various stats.

The problem is that we currently do not want to retain this data. It'd require a clear measure of usefulness, not just a "it would be nice if we had it". And there would need to be actual criteria of what we would be interested in. Raw download count? Some measure of bucketing by source IP or not? What about container/hermetic builders fetching the same ancient package over and over again from snapshot? Does the version matter?

In the end there would probably need to be a proof of concept of a log processor that's privacy-friendly and gives us the metrics that we actually want. Hence my question what these metrics are for, except for a fuzzy feeling of "working on the right priorities". There will be lots of packages that are rarely downloaded and still important.

Everyone can ask "please just retain all logs and we will do analysis on them later". Right now it'd be infeasible to get the statistics from the mirrors, and we could at most get statistics for deb.d.o. To give a sense of scale: We are sampling 1% of cache hits and all errors right now. That's 6.7 GB/d uncompressed (500 M/d compressed). Back of the envelope math says that'd be 600 GB/d of raw syslog log traffic. We should have a very good reason for collecting this much data.

Kind regards
Philipp Kern

Reply via email to