Added some CCs.
On Sunday, 9 of December 2007, Michael Tokarev wrote: > This isn't a new issue, but so far no solution(s) > has been found, it seems. And with kernel development > going on, the issue becomes worse. > > Up to 2.6.20 or so (I don't remember exactly, but if > I recall correctly the issue first appeared when new > timer code has been merged), there was no issues at > all. > > Starting with 2.6.21 (at least), I see frequent messages > from chronyd (http://chrony.sunsite.dk/), which was running > on this machine since the day one, messages telling me > that system clock has been reset - up to +/- several > seconds. > > On 2.6.22 there were only a few such messages per day, > and the time delta was less than a second - in between > 0.2-0.8 second usually. On 2.6.23, it become much, > much worse. Here's a typical log from today: > > ... > Dec 9 08:10:50 paltus chronyd[1761]: System clock wrong by 0.664195 seconds, > adjustment started > Dec 9 08:31:06 paltus chronyd[1761]: System clock wrong by 0.666263 seconds, > adjustment started > Dec 9 09:10:35 paltus chronyd[1761]: System clock wrong by 0.830534 seconds, > adjustment started > Dec 9 09:47:56 paltus chronyd[1761]: System clock wrong by -0.632430 > seconds, adjustment started > Dec 9 10:11:24 paltus chronyd[1761]: System clock wrong by 0.557093 seconds, > adjustment started > Dec 9 10:30:36 paltus chronyd[1761]: System clock wrong by 0.931442 seconds, > adjustment started > Dec 9 10:31:40 paltus chronyd[1761]: System clock wrong by 0.918518 seconds, > adjustment started > Dec 9 10:32:44 paltus chronyd[1761]: System clock wrong by 0.919524 seconds, > adjustment started > Dec 9 10:33:48 paltus chronyd[1761]: System clock wrong by -1.783922 > seconds, adjustment started > Dec 9 10:50:53 paltus chronyd[1761]: System clock wrong by 0.815781 seconds, > adjustment started > Dec 9 11:18:37 paltus chronyd[1761]: System clock wrong by -1.120279 > seconds, adjustment started > Dec 9 11:48:30 paltus chronyd[1761]: System clock wrong by -0.576473 > seconds, adjustment started > Dec 9 11:50:38 paltus chronyd[1761]: System clock wrong by 0.728826 seconds, > adjustment started > Dec 9 11:52:46 paltus chronyd[1761]: System clock wrong by 0.561940 seconds, > adjustment started > Dec 9 11:57:02 paltus chronyd[1761]: System clock wrong by -0.786041 > seconds, adjustment started > Dec 9 12:18:22 paltus chronyd[1761]: System clock wrong by -0.559688 > seconds, adjustment started > Dec 9 12:20:30 paltus chronyd[1761]: System clock wrong by 0.712789 seconds, > adjustment started > Dec 9 12:26:55 paltus chronyd[1761]: System clock wrong by -0.511155 > seconds, adjustment started > Dec 9 12:40:47 paltus chronyd[1761]: System clock wrong by 0.721966 seconds, > adjustment started > Dec 9 12:47:11 paltus chronyd[1761]: System clock wrong by -1.053725 > seconds, adjustment started > Dec 9 13:01:03 paltus chronyd[1761]: System clock wrong by 0.921670 seconds, > adjustment started > Dec 9 13:21:20 paltus chronyd[1761]: System clock wrong by 0.948468 seconds, > adjustment started > ... > > As far as I can guess, the problem is that the clock is > different on different CPUs (or, rather, cores), so when > the process (chronyd) gets moved from one core to another, > it loses its minds and behaves like the above, unable to > "understand" what's going on, and the result is that it > speeds up the clock on a core where it should slow the > clock down and vise versa, resulting it a mess. > > The machine is an IBM xSeries 3200, single CPU dual-core > PentiumD processor: > > vendor_id : GenuineIntel > cpu family : 15 > model : 6 > model name : Intel(R) Pentium(R) D CPU 2.80GHz > stepping : 4 > cpu MHz : 2800.000 > cache size : 2048 KB > > The system says that TSC clocksource is unstable and switches > to jiffies: > > Clocksource tsc unstable (delta = 70021228 ns) > Time: jiffies clocksource has been installed. > > There are two things I noticed so far. > > First, this problem is quite strongly related to cpufreq. > Namely, without cpufreq scaling (with the default "performance" > governor), everything works correctly. After switching to > "ondemand" governor, chronyd goes mad as shown above. And > after switching back to "performance", after some time (needed > for the madness to slowly stop), the problem vanishes again. > By the way, I'm not exactly sure but it seems that the above > message (about TSC being unstable) is triggered right when > switching to "ondemand" governor. > > And second, I've only seen the whole problem with 32bits > kernel. x86-64 kernel (with the exact same config) does > not expose it in any way, be it cpufreq or not. > > This last point is the very interesting one, because it > suggests that the bug is in kernel somewhere, more, that > it's in i386 part of it. > > Sadly I've only a limited testing abilities on this machine, > as it's our main production (file) server (so I can only > test something on it during weekends and late evenings). > > Similar (but at much smaller extent/scale) behavior has been > observed on several machines with AMD X2-64 processors as well, > also related to cpufreq, but this time both 64 and 32 bits > kernels behaves the same, and it seems this is a different > issue. > > Any recommendations for me to try to debug it further, please?.. > > Thanks. > > /mjt > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to [EMAIL PROTECTED] > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ > > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/