On 01.10.2017 15:06, Yanko Kaneti wrote: > On Sun, 2017-10-01 at 14:46 +0200, Thorsten Leemhuis wrote: >> Hi, the regression tracker here. What's the status of this issue? Was >> the problem fixed? It seems nothing happened for more than 10 days -- or >> did the discussion move somewhere else? Ciao, Thorsten > The commit was reverted last week before rc2 > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=0551968add53777fddd18f4ffb4e3bbc1f646d79
I could have sworn I checked that :-/ Thx for the hint and sorry for the noise! Ciao, Thorsten >> On 20.09.2017 02:30, Chuck Ebbert wrote: >>> On Tue, 19 Sep 2017 16:51:06 +0100 >>> Marc Zyngier <marc.zyng...@arm.com> wrote: >>> >>>> On 19/09/17 16:40, Yanko Kaneti wrote: >>>>> On Tue, 2017-09-19 at 16:33 +0100, Marc Zyngier wrote: >>>>>> On 19/09/17 16:12, Yanko Kaneti wrote: >>>>>>> Hello, >>>>>>> >>>>>>> Fedora rawhide config here. >>>>>>> AMD FX-8370E >>>>>>> >>>>>>> Bisected a problem to: >>>>>>> 74def747bcd0 (genirq: Restrict effective affinity to interrupts >>>>>>> actually using it) >>>>>>> >>>>>>> It seems to be causing stalls, short lived or long lived lockups >>>>>>> very shortly after boot. Everything becomes jerky. >>>>>>> >>>>>>> The only visible in the log indication is something like : >>>>>>> .... >>>>>>> [ 59.802129] clocksource: timekeeping watchdog on CPU3: Marking >>>>>>> clocksource 'tsc' as unstable because the skew is too large: >>>>>>> [ 59.802134] clocksource: 'hpet' wd_now: >>>>>>> 3326e7aa wd_last: 329956f8 mask: ffffffff [ 59.802137] >>>>>>> clocksource: 'tsc' cs_now: 423662bc6f >>>>>>> cs_last: 41dfc91650 mask: ffffffffffffffff [ 59.802140] tsc: >>>>>>> Marking TSC unstable due to clocksource watchdog [ 59.802158] >>>>>>> TSC found unstable after boot, most likely due to broken BIOS. >>>>>>> Use 'tsc=unstable'. [ 59.802161] sched_clock: Marking unstable >>>>>>> (59802142067, 15510)<-(59920871789, -118714277) [ 60.015604] >>>>>>> clocksource: Switched to clocksource hpet [ 89.015994] INFO: >>>>>>> NMI handler (perf_event_nmi_handler) took too long to run: >>>>>>> 209.660 msecs [ 89.016003] perf: interrupt took too long >>>>>>> (1638003 > 2500), lowering kernel.perf_event_max_sample_rate to >>>>>>> 1000 .... >>>>>>> >>>>>>> Just reverting that commit on top of linus mainline cures all the >>>>>>> symptoms >>>>>> >>>>>> Interesting. Do you still get HPET interrupts? >>>>> >>>>> Sorry, I might need some basic help here (i.e where do I count >>>>> them...) >>>> >>>> /proc/interrupts should display them. >>>> >>>>> After the watchdog switches the clocksource to hpet the system is >>>>> still somewhat alive, so I'll guess some clock is still >>>>> ticking.... >>>> >>>> Probably, but I suspect they're not hitting the right CPU, hence the >>>> lockups. >>>> >>>> Unfortunately, my x86-foo is pretty minimal, and I'm about to drop off >>>> the net for a few days. >>>> >>>> Thomas, any insight? >>> >>> Looking at flat_cpu_mask_to_apicid(), I don't see how 74def747bcd0 >>> can be correct: >>> >>> struct cpumask *effmsk = >>> irq_data_get_effective_affinity_mask(irqdata); unsigned long >>> cpu_mask = cpumask_bits(mask)[0] & APIC_ALL_CPUS; >>> >>> if (!cpu_mask) >>> return -EINVAL; >>> *apicid = (unsigned int)cpu_mask; >>> cpumask_bits(effmsk)[0] = cpu_mask; >>> >>> Before that patch, this function wrote to the effective mask >>> unconditionally. After, it only writes to effective_mask if it is >>> already non-zero. >>> >>> >>> http://news.gmane.org/find-root.php?message_id=20170919203044.560cb9f1%40gmail.com >>> >>> http://mid.gmane.org/20170919203044.560cb9f1%40gmail.com >>> >