Hi Michał, and the rest of the Gentoo devs, I've been patiently sitting and watching this discussion.
I raised some ideas with another developer (Not Michał) just days before he raised this thread to the ML. I believe all points raised to this point is valid, I'll try to summarise: 1. This must be completely *opt in*. 2. Anonymity was discussed by various parties (privacy). 3. "spam" protection (ie, preventing bogus data from entering). 4. Trustworthiness of data. 5. Acceptance of some form of privacy policy. In my opinion, points 2 and 3 works against each other, in that if registration is compulsory if you would like to submit stats, then we can control the spam more easily (not foolproof), but requiring registration also raises the entry barrier. I'd be completely willing to provide at least an email address as part of a submission. All of the replies seems to have focused purely on yes/no, do it or don't. Not many have addressed the benefits to end users/system administrators. It seems to focus is on what we as developers can get out of this. Regarding the above points: 1. I fully agree. This should not be forced on anyone. 2. Happy to concede that some people may wish to submit anonymously. Let them. 3. I'll address this below. 4. A lot of the discussion has been around the usefulness of the data, and I concede to Thomas that this may (or may not) generate "decision blind spots" or as per "artificially increase decision certainty". I don't see how this is worse than what we've got now. 5. We have the infrastructure for this already by way of licenses. So we ship with "GPLv2/3/whatever + GentooPrivacy", and users have to first take explicit action to accept GentooPrivacy. I have some other ideas around this, which will tread even further on privacy, but again, all of this should be a kind of opt-in, and building on the ideas by Kent where he suggested a form of submission proxy (STATS_SERVER), we could potentially give the full benefit of the code to such entities, but then still allow them to submit "upstream" in a more filtered manner. Bottom line, in my opinion: Any data is better than no data! Whilst we can't say "no one is using xyz", we will at least be able to say "hey, some people are using xyz", and whilst this may generate some blinds it at least enables us to test known use cases during test-builds, eg, we know for a fact a thousand users are using package X with USE flags "-* a b c", so we should definitely run that as a compile test. Your build breaks frequently? Would you mind submitting stats? Great thank you. You not willing to do that, then my stance becomes one of "ok, I'll help where I can, but really, please consider us to help you, if you submit stats we can pre-emptively at least include build tests for your specific USE flags." - and again, this means we can actually have our tooling use these stats to generate build tests for the "known popular" configs. I point you to RHEL - why are people willing to pay for for RHEL? What do they get for that buck? Because I promise you, the support I get from fellow Gentoo'ers FAR outweigh the support I have ever gotten from (paid for) RHEL. Most of the time. I myself used to run 500+ Gentoo hosts more than 15 years back. It was fun. I was also a student back then so had much more time on my hands than I do now. It was challenging, and fun to try and get things to work exactly the way we envisioned it should. I promise you, if what Michał proposes was available for me back then to firstly keep track of my own internal assets, and to submit stats upstream to help improve Gentoo I would not have hesitated for 10 seconds. And there I touch on a point I'm trying to make - this should be something that not only helps devs, but brings benefit to users. I'll say more on this at the end of the email (possibly force users to run some of their own infra for this at least, but these stats form the framework for a multi-system management system too, potentially). First I'd like to pay more attention to the individual points raised by Michał. On 2020/04/26 10:08, Michał Górny wrote: > Hi, > > The topic of rebooting gentoostats comes here from time to time. Unless > I'm mistaken, all the efforts so far were superficial, lacking a clear > plan and unwilling to research the problems. I'd like to start > a serious discussion focused on the issues we need to solve, and propose > some ideas how we could solve them. > > I can't promise I'll find time to implement it. However, I'd like to > get a clear plan on how it should be done if someone actually does it. My time is also limited, but I would love to be involved in some way or another. > The big questions > ================= > The way I see it, the primary goal of the project would be to gather > statistics on popularity of packages, in order to help us prioritize our > attention and make decisions on what to keep and what to remove. Unlike > Debian's popcon, I don't think we really want to try to investigate > which files are actually used but focus on what's installed. > > There are a few important questions that need to be answered first: > > 1. Which data do we need to collect? > > a. list of installed packages? > b. versions (or slots?) of installed packages? > c. USE flags on installed packages? > d. world and world_sets files > e. system profile? > f. enabled repositories? (possibly filtered to official list) All of the above. Including exact versions and USE flags for each package. Also, I'm sure there are others, but I sometimes have systems that fall behind on certain packages, either by no longer being included from world or for other reasons (eg, a specific SLOT that no longer updates for some reason, although this situation has improved). > g. distribution? /etc/gentoo-release? Yes, I think so, that partially deals with your "derivative distributions". h. date+time of last successful emerge --sync (probably individually for each repository). i. /var/log/emerge.log j. hardware data, eg, amount of RAM, CPU clock speed/cores, disks. k. hostname + other network info (IP address). i - build failures might be helpful. Might be useful to get exact merge times assuming that users want some extra features for user benefit, not gentoo dev benefit. j,k - definitely not of use to devs, but possibly to users as a form of "hardware inventory". Much of this is definitely not data that we want/need, but if the data gets proxied, then we and our users can use this as a form of inventory management system too. > I think d. is most important as it gives us information on what users > really want. a. alone is kinda redundant is we have d. c. might have > some value when deciding whether to mask a particular flag (and implies > a.). > > e. would be valuable if we wanted to determine the future of particular > profiles, as well as e.g. estimate the transition to new versions. > > f. would be valuable to determine which repositories are used but we > need to filter private repos from the output for privacy reasons. I agree with all of this. > g. could be valuable in correlation with other data but not sure if > there's much direct value alone. Don't think so, but see your own point 2. > > > 2. How to handle Gentoo derivatives? Some of them could provide > meaningful data but some could provide false data (e.g. when derivatives > override Gentoo packages). One possible option would be to filter a.-e. > to stuff coming from ::gentoo. It may be of benefit to know which ::gentoo packages they are using, and if we make the code available to those distributions as a form of proxy/peer, then any hosts that submit directly to Gentoo we could dispatch to that distributions' infra, or if we're really nice, just keep it and strip out the packages we don't maintain (ie, not ::gentoo or official repositories). > > > 3. How to keep the data up-to-date? After all, if we just stack a lot > of old data, we will soon stop getting meaningful results. I suppose > we'll need to timestamp all data and remove old entries. My opinion on this, automated cron, that dispatches daily. At least weekly. Daily provides better granularity for some other ideas aimed at system administrators. Eg, when did what change? I shove /etc into git for this reason alone with a nightly cron to commit everything and push it to a remote server, also serves as a form of configuration backup. > > > 4. How to avoid duplication? If some users submit their results more > often than others, they would bias the results. 3. might be related. I think this directly relate to SPAM. So I fully agree with the UUID per installation concept. But then systems get cloned (our labs used to be updated on a single machine, then we utilized udpcast to update the rest of the systems, so they would all end up with the same UUID). So the primary purpose of this is to find the origin of the installation, but can be trivially bypassed either by force generating a new UUID, or copying from other machines, so this can be trivially manipulated. I think we need to add a secondary, hardware based identifier. Digium (now Sangoma) checks for all MAC addresses for ethX, starting from 0 until the ioctl gets a failure, if eth0 fails, it basically does "ip ad sh" and end up including the same MAC multiple times, and in arbitrary order since the NICs aren't guaranteed to be detected in the same order on every boot. This (or a related) method could work, so generate some unique hardware-based identifier, then hash it using say SHA-256 or BLAKE2 to generate something which can't be trivially reversed back to the original identifier? Why ... well, anonymity :). We could even include the configured or dhcp obtained hostname into this. > 5. How to handle clusters? Things are simple if we can assume that > people will submit data for a few distinct systems. But what about > companies that run 50 Gentoo machines with the same or similar setup? > What about clusters of 1000 almost identical containers? Big entities > could easily bias the results but we should also make it possible for > them to participate somehow. Assuming they do what we did ... they'd probably (hopefully) all end up with the same (installation time?) UUID but different hardware identifiers. So we'd be able to identify them ... and enterprise idea, report back to those admins (assuming they registered these systems to their profile) that their clusters have discrepancies. > > > 6. Security. We don't want to expose information that could be > correlated to specific systems, as it could disclose their > vulnerabilities. Agreed. But some of this may have particular benefit for system administrators, so perhaps a secondary level of opt-in for providing "potentially sensitive data" if the Gentoo infra gets compromised. We could perhaps store a raw blob for these users that only gets decrypted by some key that only they should have/poses. Or, we could proxy the data, let the sensitive stuff travel to the proxy/aggregator, and strip that from going higher up. And they simply generate those reports locally on their proxy/aggregator. > > > 7. Privacy. Besides the above, our sysadmins would appreciate if > the data they submitted couldn't be easily correlated to them. If we > don't respect privacy of our users, we won't get them to submit data. I'm happy with either blind UUID + HW-related-hash submission, without any further data, but would really appreciate if users are willing to register. This would have the following benefits IMHO: They could subscribe for news items that affects them. They could subscribe for receiving GLSAs for packages that affect their systems. They could get a view of all their systems from a central "management" interface. I have a need to be able to ask the asterisk users on Gentoo what they need/want. As it stands, I'm suffering from "user blindness". Again, I have my own needs, and I scratch those, but helping others to get their needs scratched is a good thing. If you don't want to participate, that's fine, but if you do, you get to reap the benefit. Towards this end, and perhaps enabling some users to provide some feedback a further future step may be to enable users to anonymously submit requests via the system. Or we could get anonymous feedback from users from whom we'd normally not get any. So if the core infra on this has email addresses for all users, it could send out the email on-behalf-of the package maintainer, and feedback could then be submitted via some anonymous mechanism (eg, link in email that takes the user to a submissions page, and we explicitly don't encode per-recipient cookie-style data into the link). An idea. > > > 8. Spam protection. Finally, the service needs to be resilient to being > spammed with fake data. Both to users who want to make their packages > look more important, and to script kiddies that want to prove a point. Data only gets included after being kept up to date for a period of at least X days. Based on generated UUID + HW-Hash. UUID is (optionally but ideally) linked to a user profile. HW-Hash is just to identify unique systems. Data that doens't get kept up to date could be filtered out after Y days, where Y <= X. That way a spammer would at least need to take the effort of keeping his spamming effort going for X number of days with X number of unique (trivially spoofable) identifiers. So we don't deny that it can be done, I'm just not sure we care? Other than me, who would benefit to spoof stats for asterisk for example? Perhaps someone with a grudge? But they have my email address anyway ... so can do far worse than generate a few spoofed submissions. > My (partial) implementation idea > ================================ > I think our approach should be oriented on privacy/security first, > and attempt to make the best of the data we can get while respecting > this principle. This means no correlation and no tracking. I both agree and disagree. The most basic premise should be no tracking/correlation unless the user specifically request it towards specific functionality (eg, emailing of affecting GLSAs/news items, single-platform for viewing my hosts and what their status are). > Once the tool is installed, the user needs to opt-in to using it. This > involves accepting a privacy policy and setting up a cronjob. The tool > would suggest a (random?) time for submission to take place periodically > (say, every week). As above, I'd do this as part of accepting a license that states by accepting this license you accept the most basic submission of stats in an anonymous manner including only the most basic of identifier information to identify unique systems. > The submission would contain only raw data, without any identification > information. It would be encrypted using our public key. Once > uploaded, it would be put into our input queue as-is. Correct. Explicit action required to register UUID to user profile. If that is an option. Eg, gentoo-stat --link-to j...@iewc.co.za Then prompt for my password, which I then need to enter in order to link the UUID of the current system to my registered profile. So completely anonymous, with minimum data, unless specifically configured otherwise. > > Periodically the input queue would be processed in bulk. The individual > statistics would be updated and the input would be discarded. This > should prevent people trying to correlate changes in statistics with > individual uploads. Ok. This makes makes sense. As a sysadmin I'd like that data to be available for say 30 to 60 or even 90 days, or at least "what changed from submission X to X+1 spanning the period", because then if something breaks, I can ask "when did it break?" and then I can ask the stats system "what changed on the related systems around that time?". At Gentoo core infra level, we can potentially discard as soon as processed, but depending on the algorihm we may need to keep at least the latest submitted copy for Y number of days (as defined above). Ok, yes, I can do that by working through /var/log/emerge.log as well, or genlop -l, but I need to do that system by system. If I have an environment of 500 hosts this gets tedious. Or what if I'd like to find what differs between a set of hosts where a feature X works, and others that don't? > > What do you think? Do you foresee other problems? Do you have other > needs? Can you think of better solutions? > I think we should build a hierarchy. So Gentoo-infra at the top. End users may submit only certain types of data there, all other data we as devs don't care about gets discarded, and if we allow users to register there directly we limit the functionality thereof in order to maintain the requirements of the developers here first and foremost. As such, the submitted package should be based on "data sets" in my opinion, where the most basic sets could be: core: a) package list including versions and use flags b) world and world_sets c) uuid d) hash(hardware ident) hardware: a) RAM b) ... network: a) ... At the Gentoo-infra layer we can then have a policy that we ONLY accept "core" sets. If it's easy to at the proxy/aggregator level define your own sets, and provide mechanisms to obtain the data (or as plugins on the hosts themselves, eg, USE="hardware network" gentoo-stats-plugins style, with the main package only containing what the devs need. Just ideas. Further down the hierarchy additional sets could be defined, and proxy/aggregator hosts could define what information they allow higher up the hierarchy. If we receive information for a gentoo derivate we redirect it to that distribution. Although for such a case we really should provide a way for derivatives to specify their own "default" infra. Other projects can then build on top of, or as plug-ins of the core stats project to then provide the more enterprise-like features. One could potentially even go as far as automated updating driven from a central control server in a networked environment where the proxy/aggregator is able to connect back to the individual hosts to execute commands on them. I sincerely hope my ramblings haven't been completely off point. I believe the above shows that this can be of benefit to users and developers alike, and hopefully in a way that does not infringe on user users' rights or privacy. One thing could be for aggregators to submit aggregated stats instead of individual systems, again, same X and Y stuff would apply, however, I think for aggregated submissions the data skew risk becomes even larger. So perhaps we should provide two sets of stats "excluding aggregated stats" and including, or possibly we can mark some aggregators as trusted. I dunno. Kind Regards, Jaco