On Thu, May 10, 2018 at 09:27:50PM -0500, Samuel Holland wrote: > The Allwinner A64 SoC is known [1] to have an unstable architectural > timer, which manifests itself most obviously in the time jumping forward > a multiple of 95 years [2][3]. This coincides with 2^56 cycles at a > timer frequency of 24 MHz, implying that the time went slightly backward > (and this was interpreted by the kernel as it jumping forward and > wrapping around past the epoch). > > Further investigation revealed instability in the low bits of CNTVCT at > the point a high bit rolls over. This leads to power-of-two cycle > forward and backward jumps. (Testing shows that forward jumps are about > twice as likely as backward jumps.) > > Without trapping reads to CNTVCT, a userspace program is able to read it > in a loop faster than it changes. A test program running on all 4 CPU > cores that reported jumps larger than 100 ms was run for 13.6 hours and > reported the following: > > Count | Event > -------+--------------------------- > 9940 | jumped backward 699ms > 268 | jumped backward 1398ms > 1 | jumped backward 2097ms > 16020 | jumped forward 175ms > 6443 | jumped forward 699ms > 2976 | jumped forward 1398ms > 9 | jumped forward 356516ms > 9 | jumped forward 357215ms > 4 | jumped forward 714430ms > 1 | jumped forward 3578440ms > > This works out to a jump larger than 100 ms about every 5.5 seconds on > each CPU core. > > The largest jump (almost an hour!) was the following sequence of reads: > 0x0000007fffffffff → 0x00000093feffffff → 0x0000008000000000 > > Note that the middle bits don't necessarily all read as all zeroes or > all ones during the anomalous behavior; however the low 11 bits checked > by the function in this patch have never been observed with any other > value. > > Also note that smaller jumps are much more common, with the smallest > backward jumps of 2048 cycles observed over 400 times per second on each > core. (Of course, this is partially due to lower bits rolling over more > frequently.) Any one of these could have caused the 95 year time skip. > > Similar anomalies were observed while reading CNTPCT (after patching the > kernel to allow reads from userspace). However, the jumps are much less > frequent, and only small jumps were observed. The same program as before > (except now reading CNTPCT) observed after 72 hours: > > Count | Event > -------+--------------------------- > 17 | jumped backward 699ms > 52 | jumped forward 175ms > 2831 | jumped forward 699ms > 5 | jumped forward 1398ms > > ======================================================================== > > Because the CPU can read the CNTPCT/CNTVCT registers faster than they > change, performing two reads of the register and comparing the high bits > (like other workarounds) is not a workable solution. And because the > timer can jump both forward and backward, no pair of reads can > distinguish a good value from a bad one. The only way to guarantee a > good value from consecutive reads would be to read _three_ times, and > take the middle value iff the three values are 1) individually unique > and 2) increasing. This takes at minimum 3 cycles (125 ns), or more if > an anomaly is detected. > > However, since there is a distinct pattern to the bad values, we can > optimize the common case (2046/2048 of the time) to a single read by > simply ignoring values that match the pattern. This still takes no more > than 3 cycles in the worst case, and requires much less code.
That's an awesome commit log, thanks! For both patches: Acked-by: Maxime Ripard <maxime.rip...@bootlin.com> > [1]: https://github.com/armbian/build/commit/a08cd6fe7ae9 > [2]: https://forum.armbian.com/topic/3458-a64-datetime-clock-issue/ Sigh. So armbian knew about this for more than a year and had a fix, and didn't judge necessary to report it anywhere. That's some solid, responsible, development right there... Maxime -- Maxime Ripard, Bootlin (formerly Free Electrons) Embedded Linux and Kernel engineering https://bootlin.com
signature.asc
Description: PGP signature