> What about this: > - Add one more list_head to struct dev_pm_info. > - Make dpm_prepare() create a new list for the next steps instead of moving > devices out of dpm_list. > - Start an async work to carry out dpm_suspend() and make the main thread > do wait_for_completion_timeout() for every device in dpm_list (in the > reverse order). > - If it times out, mark the device in question as unusable, possibly resume > the already suspended devices (except for descendants of the failed one) > and abort the suspend. Return a specific error code to user space so that > it knows what happened. [You can make this step configurable to BUG() > instead of doing all those things if you think that will be more useful for > platforms you care about.] > - Disable future suspends. > And analogously for resume. > > That should allow people to investigate what happened on a system that > (hopefully) is not completely dead and you still can have your "reboot if > suspend hangs" feature if you like.
I looked into implementing this. The problem that I encountered is that there is no reliable way of canceling an async task, and hence the asynchronous __device_suspend() would be left racing with a recovery from a suspend timeout. We could do cancel_work_sync() as a recovery, but that call blocks until the running async task is flushed, which might never happen. So doing a panic() is pretty much the only option for recovering. - Zoran -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/