Don Zickus <dzic...@redhat.com> writes:
> On Wed, Dec 19, 2012 at 08:51:31PM +0100, Bjørn Mork wrote:
>> commit 8d451690 ("watchdog: Fix CPU hotplug regression") cause
>> an oops or hard lockup when doing
>> 
>>  echo 0 > /proc/sys/kernel/nmi_watchdog
>>  echo 1 > /proc/sys/kernel/nmi_watchdog
>> 
>> and the kernel is booted with nmi_watchdog=1 (default)
>> 
>> Running laptop-mode-tools and disconnecting/connecting AC power
>> will cause this to trigger, making it a common failure scenario
>> on laptops.
>> 
>> Instead of bailing out of watchdog_disable() when !watchdog_enabled
>> we can initialize the hrtimer regardless of watchdog_enabled status.
>> This makes it safe to call watchdog_disable() in the nmi_watchdog=0
>> case, without the negative effect on the enabled => disabled =>
>> enabled case.
>> 
>> All these tests pass with this patch:
>> - nmi_watchdog=1
>>   echo 0 > /proc/sys/kernel/nmi_watchdog
>>   echo 1 > /proc/sys/kernel/nmi_watchdog
>> 
>> - nmi_watchdog=0
>>   echo 0 > /sys/devices/system/cpu/cpu1/online
>> 
>> - nmi_watchdog=0
>>   echo mem > /sys/power/state
>
> What about the opposite cases?  
> nmi_watchdog=1
> echo 1 > /sys/devices/system/cpu/cpu1/online

I don't see why not. But verifying it would be nice.  I thought that it
would be a simple thing to test using qemu-kvm, but it seems that the
CPU hotplugging support there isn't quite ready. The guest just dies
with "Assertion `bus->allow_hotplug' failed."

I'll go digging for alternatives, but if anyone else could verify this
then I'd appreciate it.

> Are we going to leak memory by re-initing hrtimer?  Or is that caught
> somewhere?

There are no allocations in hrtimer_init() AFAICS? It just initializes
the preallocated hrtimer struct.

> Other than that, the patch seems reasonable to me.  Basically just forcing
> the init of hrtimer so there is something to cancel on the disable side.
> Then again, I don't fully understand the original problem.

I believe the original problem was calling hrtimer_cancel on an
uninitialized hrtimer.  That is likely to blow up here:

static inline
void unlock_hrtimer_base(const struct hrtimer *timer, unsigned long *flags)
{
        raw_spin_unlock_irqrestore(&timer->base->cpu_base->lock, *flags);
}


Note that the lock_hrtimer_base() function protects access to
timer->base with a NULL test.  That should probably be done in unlock as
well, for symmetry?  But this is another issue and not at all urgent.




Bjørn
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Reply via email to