Re: [gentoo-dev] [RFC] Ideas for gentoostats implementation

Jaco Kroon Tue, 05 May 2020 11:05:20 -0700

Hi Michał, and the rest of the Gentoo devs,

I've been patiently sitting and watching this discussion.

I raised some ideas with another developer (Not Michał) just days before
he raised this thread to the ML.

I believe all points raised to this point is valid, I'll try to summarise:

1.  This must be completely *opt in*.
2.  Anonymity was discussed by various parties (privacy).
3.  "spam" protection (ie, preventing bogus data from entering).
4.  Trustworthiness of data.
5.  Acceptance of some form of privacy policy.

In my opinion, points 2 and 3 works against each other, in that if
registration is compulsory if you would like to submit stats, then we
can control the spam more easily (not foolproof), but requiring
registration also raises the entry barrier.  I'd be completely willing
to provide at least an email address as part of a submission.

All of the replies seems to have focused purely on yes/no, do it or
don't.  Not many have addressed the benefits to end users/system
administrators.  It seems to focus is on what we as developers can get
out of this.

Regarding the above points:

1.  I fully agree.  This should not be forced on anyone.
2.  Happy to concede that some people may wish to submit anonymously. 
Let them.
3.  I'll address this below.
4.  A lot of the discussion has been around the usefulness of the data,
and I concede to Thomas that this may (or may not) generate "decision
blind spots" or as per "artificially increase decision certainty".  I
don't see how this is worse than what we've got now.
5.  We have the infrastructure for this already by way of licenses.  So
we ship with "GPLv2/3/whatever + GentooPrivacy", and users have to first
take explicit action to accept GentooPrivacy.

I have some other ideas around this, which will tread even further on
privacy, but again, all of this should be a kind of opt-in, and building
on the ideas by Kent where he suggested a form of submission proxy
(STATS_SERVER), we could potentially give the full benefit of the code
to such entities, but then still allow them to submit "upstream" in a
more filtered manner.

Bottom line, in my opinion:  Any data is better than no data!

Whilst we can't say "no one is using xyz", we will at least be able to
say "hey, some people are using xyz", and whilst this may generate some
blinds it at least enables us to test known use cases during
test-builds, eg, we know for a fact a thousand users are using package X
with USE flags "-* a b c", so we should definitely run that as a compile
test.  Your build breaks frequently?  Would you mind submitting stats? 
Great thank you.  You not willing to do that, then my stance becomes one
of "ok, I'll help where I can, but really, please consider us to help
you, if you submit stats we can pre-emptively at least include build
tests for your specific USE flags." - and again, this means we can
actually have our tooling use these stats to generate build tests for
the "known popular" configs.

I point you to RHEL - why are people willing to pay for for RHEL?  What
do they get for that buck?  Because I promise you, the support I get
from fellow Gentoo'ers FAR outweigh the support I have ever gotten from
(paid for) RHEL.  Most of the time.

I myself used to run 500+ Gentoo hosts more than 15 years back.  It was
fun.  I was also a student back then so had much more time on my hands
than I do now.  It was challenging, and fun to try and get things to
work exactly the way we envisioned it should.  I promise you, if what
Michał proposes was available for me back then to firstly keep track of
my own internal assets, and to submit stats upstream to help improve
Gentoo I would not have hesitated for 10 seconds.

And there I touch on a point I'm trying to make - this should be
something that not only helps devs, but brings benefit to users.  I'll
say more on this at the end of the email (possibly force users to run
some of their own infra for this at least, but these stats form the
framework for a multi-system management system too, potentially).  First
I'd like to pay more attention to the individual points raised by Michał.

On 2020/04/26 10:08, Michał Górny wrote:

> Hi,
>
> The topic of rebooting gentoostats comes here from time to time.  Unless
> I'm mistaken, all the efforts so far were superficial, lacking a clear
> plan and unwilling to research the problems.  I'd like to start
> a serious discussion focused on the issues we need to solve, and propose
> some ideas how we could solve them.
>
> I can't promise I'll find time to implement it.  However, I'd like to
> get a clear plan on how it should be done if someone actually does it.

My time is also limited, but I would love to be involved in some way or
another.

> The big questions
> =================
> The way I see it, the primary goal of the project would be to gather
> statistics on popularity of packages, in order to help us prioritize our
> attention and make decisions on what to keep and what to remove.  Unlike
> Debian's popcon, I don't think we really want to try to investigate
> which files are actually used but focus on what's installed.
>
> There are a few important questions that need to be answered first:
>
> 1. Which data do we need to collect?
>
>    a. list of installed packages?
>    b. versions (or slots?) of installed packages?
>    c. USE flags on installed packages?
>    d. world and world_sets files
>    e. system profile?
>    f. enabled repositories? (possibly filtered to official list)
All of the above.  Including exact versions and USE flags for each
package.  Also, I'm sure there are others, but I sometimes have systems
that fall behind on certain packages, either by no longer being included
from world or for other reasons (eg, a specific SLOT that no longer
updates for some reason, although this situation has improved).
>    g. distribution?
/etc/gentoo-release?

Yes, I think so, that partially deals with your "derivative distributions".

h.  date+time of last successful emerge --sync (probably individually
for each repository).
i.  /var/log/emerge.log
j.  hardware data, eg, amount of RAM, CPU clock speed/cores, disks.
k.  hostname + other network info (IP address).

i - build failures might be helpful.  Might be useful  to get exact
merge times assuming that users want some extra features for user
benefit, not gentoo dev benefit.
j,k - definitely not of use to devs, but possibly to users as a form of
"hardware inventory".

Much of this is definitely not data that we want/need, but if the data
gets proxied, then we and our users can use this as a form of inventory
management system too.

> I think d. is most important as it gives us information on what users
> really want.  a. alone is kinda redundant is we have d.  c. might have
> some value when deciding whether to mask a particular flag (and implies
> a.).
>
> e. would be valuable if we wanted to determine the future of particular
> profiles, as well as e.g. estimate the transition to new versions.
>
> f. would be valuable to determine which repositories are used but we
> need to filter private repos from the output for privacy reasons.
I agree with all of this.
> g. could be valuable in correlation with other data but not sure if
> there's much direct value alone.
Don't think so, but see your own point 2.
>
>
> 2. How to handle Gentoo derivatives?  Some of them could provide
> meaningful data but some could provide false data (e.g. when derivatives
> override Gentoo packages).  One possible option would be to filter a.-e. 
> to stuff coming from ::gentoo.

It may be of benefit to know which ::gentoo packages they are using, and
if we make the code available to those distributions as a form of
proxy/peer, then any hosts that submit directly to Gentoo we could
dispatch to that distributions' infra, or if we're really nice, just
keep it and strip out the packages we don't maintain (ie, not ::gentoo
or official repositories).

>
>
> 3. How to keep the data up-to-date?  After all, if we just stack a lot
> of old data, we will soon stop getting meaningful results.  I suppose
> we'll need to timestamp all data and remove old entries.

My opinion on this, automated cron, that dispatches daily.  At least
weekly.  Daily provides better granularity for some other ideas aimed at
system administrators.  Eg, when did what change?  I shove /etc into git
for this reason alone with a nightly cron to commit everything and push
it to a remote server, also serves as a form of configuration backup.

>
>
> 4. How to avoid duplication?  If some users submit their results more
> often than others, they would bias the results.  3. might be related.

I think this directly relate to SPAM.  So I fully agree with the UUID
per installation concept.  But then systems get cloned (our labs used to
be updated on a single machine, then we utilized udpcast to update the
rest of the systems, so they would all end up with the same UUID).  So
the primary purpose of this is to find the origin of the installation,
but can be trivially bypassed either by force generating a new UUID, or
copying from other machines, so this can be trivially manipulated.

I think we need to add a secondary, hardware based identifier.

Digium (now Sangoma) checks for all MAC addresses for ethX, starting
from 0 until the ioctl gets a failure, if eth0 fails, it basically does
"ip ad sh" and end up including the same MAC multiple times, and in
arbitrary order since the NICs aren't guaranteed to be detected in the
same order on every boot.  This (or a related) method could work, so
generate some unique hardware-based identifier, then hash it using say
SHA-256 or BLAKE2 to generate something which can't be trivially
reversed back to the original identifier?  Why ... well, anonymity :). 
We could even include the configured or dhcp obtained hostname into this.

> 5. How to handle clusters?  Things are simple if we can assume that
> people will submit data for a few distinct systems.  But what about
> companies that run 50 Gentoo machines with the same or similar setup? 
> What about clusters of 1000 almost identical containers?  Big entities
> could easily bias the results but we should also make it possible for
> them to participate somehow.
Assuming they do what we did ... they'd probably (hopefully) all end up
with the same (installation time?) UUID but different hardware
identifiers.  So we'd be able to identify them ... and enterprise idea,
report back to those admins (assuming they registered these systems to
their profile) that their clusters have discrepancies.
>
>
> 6. Security.  We don't want to expose information that could be
> correlated to specific systems, as it could disclose their
> vulnerabilities.

Agreed.  But some of this may have particular benefit for system
administrators, so perhaps a secondary level of opt-in for providing
"potentially sensitive data" if the Gentoo infra gets compromised.  We
could perhaps store a raw blob for these users that only gets decrypted
by some key that only they should have/poses.

Or, we could proxy the data, let the sensitive stuff travel to the
proxy/aggregator, and strip that from going higher up.  And they simply
generate those reports locally on their proxy/aggregator.

>
>
> 7. Privacy.  Besides the above, our sysadmins would appreciate if
> the data they submitted couldn't be easily correlated to them.  If we
> don't respect privacy of our users, we won't get them to submit data.

I'm happy with either blind UUID + HW-related-hash submission, without
any further data, but would really appreciate if users are willing to
register.  This would have the following benefits IMHO:

They could subscribe for news items that affects them.
They could subscribe for receiving GLSAs for packages that affect their
systems.
They could get a view of all their systems from a central "management"
interface.

I have a need to be able to ask the asterisk users on Gentoo what they
need/want.  As it stands, I'm suffering from "user blindness".  Again, I
have my own needs, and I scratch those, but helping others to get their
needs scratched is a good thing.  If you don't want to participate,
that's fine, but if you do, you get to reap the benefit.  Towards this
end, and perhaps enabling some users to provide some feedback a further
future step may be to enable users to anonymously submit requests via
the system.  Or we could get anonymous feedback from users from whom
we'd normally not get any.  So if the core infra on this has email
addresses for all users, it could send out the email on-behalf-of the
package maintainer, and feedback could then be submitted via some
anonymous mechanism (eg, link in email that takes the user to a
submissions page, and we explicitly don't encode per-recipient
cookie-style data into the link).  An idea.

>
>
> 8. Spam protection.  Finally, the service needs to be resilient to being
> spammed with fake data.  Both to users who want to make their packages
> look more important, and to script kiddies that want to prove a point.

Data only gets included after being kept up to date for a period of at
least X days.  Based on generated UUID + HW-Hash.  UUID is (optionally
but ideally) linked to a user profile.  HW-Hash is just to identify
unique systems.

Data that doens't get kept up to date could be filtered out after Y
days, where Y <= X.  That way a spammer would at least need to take the
effort of keeping his spamming effort going for X number of days with X
number of unique (trivially spoofable) identifiers.  So we don't deny
that it can be done, I'm just not sure we care?

Other than me, who would benefit to spoof stats for asterisk for
example?  Perhaps someone with a grudge?  But they have my email address
anyway ... so can do far worse than generate a few spoofed submissions.

> My (partial) implementation idea
> ================================
> I think our approach should be oriented on privacy/security first,
> and attempt to make the best of the data we can get while respecting
> this principle.  This means no correlation and no tracking.

I both agree and disagree.  The most basic premise should be no
tracking/correlation unless the user specifically request it towards
specific functionality (eg, emailing of affecting GLSAs/news items,
single-platform for viewing my hosts and what their status are).

> Once the tool is installed, the user needs to opt-in to using it.  This
> involves accepting a privacy policy and setting up a cronjob.  The tool
> would suggest a (random?) time for submission to take place periodically
> (say, every week).

As above, I'd do this as part of accepting a license that states by
accepting this license you accept the most basic submission of stats in
an anonymous manner including only the most basic of identifier
information to identify unique systems.

> The submission would contain only raw data, without any identification
> information.  It would be encrypted using our public key.  Once
> uploaded, it would be put into our input queue as-is.

Correct.  Explicit action required to register UUID to user profile.  If
that is an option.

Eg, gentoo-stat --link-to j...@iewc.co.za

Then prompt for my password, which I then need to enter in order to link
the UUID of the current system to my registered profile.

So completely anonymous, with minimum data, unless specifically
configured otherwise.

>
> Periodically the input queue would be processed in bulk.  The individual
> statistics would be updated and the input would be discarded.  This
> should prevent people trying to correlate changes in statistics with
> individual uploads.

Ok.  This makes makes sense.  As a sysadmin I'd like that data to be
available for say 30 to 60 or even 90 days, or at least "what changed
from submission X to X+1 spanning the period", because then if something
breaks, I can ask "when did it break?" and then I can ask the stats
system "what changed on the related systems around that time?".  At
Gentoo core infra level, we can potentially discard as  soon as
processed, but depending on the algorihm we may need to keep at least
the latest submitted copy for Y number of days (as defined above).

Ok, yes, I can do that by working through /var/log/emerge.log as well,
or genlop -l, but I need to do that system by system.  If I have an
environment of 500 hosts this gets tedious.  Or what if I'd like to find
what differs between a set of hosts where a feature X works, and others
that don't?

>
> What do you think?  Do you foresee other problems?  Do you have other
> needs?  Can you think of better solutions?
>
I think we should build a hierarchy.  So Gentoo-infra at the top.  End
users may submit only certain types of data there, all other data we as
devs don't care about gets discarded, and if we allow users to register
there directly we limit the functionality thereof in order to maintain
the requirements of the developers here first and foremost.

As such, the submitted package should be based on "data sets" in my
opinion, where the most basic sets could be:

core:
  a) package list including versions and use flags
  b) world and world_sets
  c) uuid
  d) hash(hardware ident)

hardware:
  a) RAM
  b) ...

network:
  a) ...

At the Gentoo-infra layer we can then have a policy that we ONLY accept
"core" sets.  If it's easy to at the proxy/aggregator level define your
own sets, and provide mechanisms to obtain the data (or as plugins on
the hosts themselves, eg, USE="hardware network" gentoo-stats-plugins
style, with the main package only containing what the devs need.  Just
ideas.

Further down the hierarchy additional sets could be defined, and
proxy/aggregator hosts could define what information they allow higher
up the hierarchy.

If we receive information for a gentoo derivate we redirect it to that
distribution.  Although for such a case we really should provide a way
for derivatives to specify their own "default" infra.

Other projects can then build on top of, or as plug-ins of the core
stats project to then provide the more enterprise-like features.  One
could potentially even go as far as automated updating driven from a
central control server in a networked environment where the
proxy/aggregator is able to connect back to the individual hosts to
execute commands on them.

I sincerely hope my ramblings haven't been completely off point.  I
believe the above shows that this can be of benefit to users and
developers alike, and hopefully in a way that does not infringe on user
users' rights or privacy.

One thing could be for aggregators to submit aggregated stats instead of
individual systems, again, same X and Y stuff would apply, however, I
think for aggregated submissions the data skew risk becomes even
larger.  So perhaps we should provide two sets of stats "excluding
aggregated stats" and including, or possibly we can mark some
aggregators as trusted.  I dunno.

Kind Regards,
Jaco

Re: [gentoo-dev] [RFC] Ideas for gentoostats implementation

Reply via email to