On Tue, May 12, 2020 at 10:00:20AM +0300, Andreas Gustafsson wrote: > > Adding more sources could mean > reintroducing some timing based sources after careful analysis, but > also things like having the installer install an initial random seed > on the target machine (and if the installer itself lacks entropy, > asking the poor user to pound on the keyboard until it does).
As Peter Gutmann has noted several times in the past, for most use cases you don't have to have input you know are tied to physical random processes; you just have to have input you know it's uneconomical for your adversary to predict or recover. This is why I fed the first N packets off the network into the RNG; why I added sampling of kernel printf output (and more importantly, its *timing); etc. But the problem with this kind of stuff is that there really are use cases where an adversary _can_ know these things, so it is very hard to support an argument that _in the general case_ they should be used to satisfy some criterion that N unpredictable, unrecoverable (note I do *not* say "random" here!) bits have been fed into the machinery. The data I fed in from the VM system are not quite the same, but in a somewhat similar situation. That said, I also added a number of sources which we *do* know are tied to real random physical processes: the environmental sensors such as temperature, fan speed, and voltage, where beyond the sampling noise you've got thermal processes on both micro and macro scales, turbulence, etc; and the "skew" source type which, in theory, represents skew between multiple oscillators in the system, one of the hybrid analog-digital RNG designs with a long pedigree (though as implemented in the example "callout" source, less-so). Finally, there's a source type I *didn't* take advantage of because I was advised doing so would cause substantial power consumption: amplifier noise available by setting muted audio inputs to max gain (we can also use the sample arrival rate here as a skew source). I believe we can and should legitimately record entropy when we add input of these kinds. But there are three problems with all this. *Problems are marked out with numbers, thoughts towards solutions or mitigations with letters.* 1) It's hard to understand how many bits of entropy to assign to a sample from one of these sources. How much of the change in fan speed is caused by system load as a factor (and thus highly correlated with CPU temperature), and how much by turbulence, which we believe is random? How much of the signal measured from amplifier noise on a muted input is caused by the bus clock (and clocks derived from it, etc.) and how much is genuine thermal noise from the amplifier? And so forth. The delta estimator _was_ good for these things, particularly for things like fans or thermistors (where the macroscopic, non-random physical processes _are_ expected to have continuous behavior), because it could tell you when to very conservatively add 1 bit. If you believe that at least 1 bit of each 32-bit value from the input really is attributable to entropy. I also prototyped an lzf-based entropy estimator, but it never really seemed worth the trouble -- it is, though, consistent with how the published analysis of physical sources often estimates minimum entropy. A) This is a longwinded way of saying I firmly believe we should count input from these kinds of sources towards our "full entropy" threshold but need to think harder about how. 2) Sources of the kind I'm talking about here seldom contribute _much_ entropy - with the old estimator, perhaps 1 bit per change - so if you need to get 256 bits from them you may be waiting quite some time (the audio-amp sources might be different, which is another reason despite their issues, they are appealing). 3) Older or smaller systems don't have any of this stuff onboard so it does them no good: no fan speed sensors (or no drivers for them), no temp sensors, no access to power rail voltages, certainly no audio, etc. B) One thing we *could* do to help out such systems would be to actually run a service to bootstrap them with entropy ourselves, from the installer, across the network. Should a user trust such a service? I will argue "yes". Why? B1) Because they already got the binaries or the sources from us; we could simply tamper those to do the wrong thing instead. Counterargument: it's impossible to distinguish the output of a cryptographically-strong stream cipher keyed with something known to us from real random data, so it's harder to _tell_ if we subverted you. Counter-counter-argument: When's the last time you looked? Users who _do_ really inspect the sources and binaries they get from us can always not use our entropy server, or run their own. B2) Because we have already arranged to mix in a whole pile of stuff whose entropy is hard to estimate but which would be awfully hard for us, the OS developers, to predict or recover with respect to an arbitrary system being installed (all the sources we used to count but now don't, plus the ones we never counted). If you trust the core kernel RNG mixing machinery, you should have some level of confidence this protects you against subversion by an entropy server that initially gets the ball rolling. B3) Because we can easily arrange for you to mix the input we give you with an additional secret we don't and can't know, which you may make as strong as you like: we can prompt you at install time to enter a passphrase, and use that to encrypt the entropy we serve you, using a strong cipher, before using it to initially seed the RNG. So, those are the problems I see and some potential solutions: figure out how to better estimate the entropy of the environmental sources we have available, and count such estimates by default; consider using audio sources by default; and run an entropy server to seed systems from the installer. What do others think? Thor