Andrew, > ... what inspired this patchset? > Are you experiencing kthread_park() failures in practice?
I did not experience kthread_park() failures in practice. Looking at watchdog_park_threads() from 81a4beef91ba4a9e8ad6054ca9933dff7e25ff28 I realized that there is a theoretical corner case which would not be handled well. Let's assume that kthread_park() would return an error in the following flow of execution (the user changes watchdog_thresh). proc_watchdog_thresh set_sample_period() // // The watchdog_thresh and sample_period variable are now set to // the new value. // proc_watchdog_update watchdog_enable_all_cpus update_watchdog_all_cpus watchdog_park_threads Let's say the system has eight CPUs and that kthread_park() failed to park watchdog/4. In this example watchdog/0 .. watchdog/3 are already parked and watchdog/5 .. watchdog/7 are not parked yet (we don't know exactly what happened to watchdog/4). watchdog_park_threads() unparks the threads if kthread_park() of one thread fails. for_each_watchdog_cpu(cpu) { ret = kthread_park(per_cpu(softlockup_watchdog, cpu)); if (ret) break; } if (ret) { for_each_watchdog_cpu(cpu) kthread_unpark(per_cpu(softlockup_watchdog, cpu)); } watchdog/0 .. watchdog/3 will pick up the new watchdog_thresh value when they are unparked (please see the watchdog_enable() function), whereas watchdog/5 .. watchdog/7 will continue to use the old value for the hard lockup detector and begin using the new value for the soft lockup detector (kthread_unpark() sees watchdog/5 .. watchdog/7 in the unparked state, so it skips these threads). The inconsistency which results from using different watchdog_thresh values can cause unexpected behaviour of the lockup detectors (e.g. false positives). The new error handling that is introduced by this patch set aims to handle the above corner case in a better way (this was my original motivation to come up with a patch set). However, I also think that _if_ kthread_park() would ever be changed in the future so that it could return errors under various (other) conditions, the patch set should prepare the watchdog code for this possibility. Since I did not experience kthread_park() failures in practice, I used some instrumentation to fake error returns from kthread_park() in order to test the patches. Regards, Uli ----- Original Message ----- From: "Andrew Morton" <a...@linux-foundation.org> To: "Ulrich Obergfell" <uober...@redhat.com> Cc: linux-kernel@vger.kernel.org, dzic...@redhat.com, atom...@redhat.com Sent: Wednesday, September 30, 2015 1:30:36 AM Subject: Re: [PATCH 0/5] improve handling of errors returned by kthread_park() On Mon, 28 Sep 2015 22:44:07 +0200 Ulrich Obergfell <uober...@redhat.com> wrote: > The original watchdog_park_threads() function that was introduced by > commit 81a4beef91ba4a9e8ad6054ca9933dff7e25ff28 takes a very simple > approach to handle errors returned by kthread_park(): It attempts to > roll back all watchdog threads to the unparked state. However, this > may be undesired behaviour from the perspective of the caller which > may want to handle errors as appropriate in its specific context. > Currently, there are two possible call chains: > > - watchdog suspend/resume interface > > lockup_detector_suspend > watchdog_park_threads > > - write to parameters in /proc/sys/kernel > > proc_watchdog_update > watchdog_enable_all_cpus > update_watchdog_all_cpus > watchdog_park_threads > > Instead of 'blindly' attempting to unpark the watchdog threads if a > kthread_park() call fails, the new approach is to disable the lockup > detectors in the above call chains. Failure becomes visible to the > user as follows: > > - error messages from lockup_detector_suspend() > or watchdog_enable_all_cpus() > > - the state that can be read from /proc/sys/kernel/watchdog_enabled > > - the 'write' system call in the latter call chain returns an error > hm, you made me look at kthread parking. Why does it exist? What is a "parked" thread anyway, and how does it differ from, say, a sleeping one? The 2a1d446019f9a5983ec5a335b changelog is pretty useless and the patch added no useful documentation, sigh. Anwyay... what inspired this patchset? Are you experiencing kthread_park() failures in practice? If so, what is causing them? And what is the user-visible effect of these failures? This is all pretty important context for such a patchset. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/