On Thu, Apr 19, 2012 at 5:55 PM, Jesse Barnes <jbarnes at virtuousgeek.org> wrote: > On Thu, 19 Apr 2012 17:52:39 +0100 > Dave Airlie <airlied at gmail.com> wrote: > >> On Thu, Apr 19, 2012 at 5:47 PM, Dave Airlie <airlied at gmail.com> wrote: >> > On Thu, Apr 19, 2012 at 5:41 PM, Andy Whitcroft <apw at canonical.com> >> > wrote: >> >> On Thu, Apr 19, 2012 at 05:30:03PM +0100, Dave Airlie wrote: >> >>> On Thu, Apr 19, 2012 at 5:22 PM, Andy Whitcroft <apw at canonical.com> >> >>> wrote: >> >>> > We have been carrying a (rather poor) patch for an issue we identified >> >>> > in >> >>> > the DRM driver. ?This issue is triggered when a DRM device is >> >>> > initialising >> >>> > and userspace attempts to open it, typically in response to the sysfs >> >>> > device added event. ?Basically we allocate the minor numbers making >> >>> > the device available, and then call the drm load callback. ?Until this >> >>> > completes the device is really not ready and these early opens >> >>> > typically >> >>> > lead to oopses. >> >>> > >> >>> > We have been using the following patch to avoid this by marking the >> >>> > minors >> >>> > as in error until the load method has completed. ?This avoids the early >> >>> > open by simply erroring out the opens with EAGAIN. ?Obviously we should >> >>> > be delaying the open until the load method complete. >> >>> > >> >>> > I include the existing patch for completness (it is not really ready >> >>> > for >> >>> > merging) to illustrate the issue. ?I think it is logical that the wait >> >>> > should simply be delayed until the load has completed. ?I am proposing >> >>> > to include a wait queue associated with the idr cache for the drm >> >>> > minors >> >>> > which we can use to allow open callers to wait_event_interruptible() >> >>> > on. >> >>> > I'll be putting together a prototype shortly and will follow up with >> >>> > it. >> >>> > >> >>> > Thoughts? >> >>> >> >>> Couldn't we just delay registering things until the driver is ready to >> >>> accept an open? >> >>> >> >>> Granted the midlayer of drm doesn't make that easy, >> >> >> >> It seems that we need the dri minor allocated before we hit the load >> >> function as things are done right now. >> >> >> >>> thanks for sending this out, it keeps falling off my radar, I don't >> >>> think I've ever seen this reported on RHEL/Fedora, which makes me >> >>> wonder what we are doing that makes us lucky. >> >> >> >> We never hit it until we started doing things earlier and quicker. ?I >> >> first >> >> found it in the prettification of boot so we were keen to get plymouth >> >> running as soon as possible. ?That lead to random panics and me finding >> >> this bug. ?The window is tiny as far as I know and it tends to be specific >> >> machines and specific package combinations which trigger it reliably. >> >> >> >> I suspect that a proper fix would allow delaying the registration as you >> >> suggest but in the interim a wait would at least avoid the issues we are >> >> seeing. ?I will see how awful it looks. >> > >> > Just to confirm its the drm_sysfs_device_add that causes the race we care >> > about. >> > >> > it needs to happen after the driver is happy. Since it calls >> > device_register and that is what triggers udev magic to load the >> > userspace. >> > >> > If you have a userspace app banging on a static device node that might >> > need another set of fun fixes. >> >> Okay the sysfs add and the idr_replace are the things we need to delay. > > Since you can still get at things with a static node, it seems like > locking is the real issue here? ?Is there no mutex we can take across > init to block any openers until we're done?
well the idr replace should be the thing that matters, since before that openers get -ENODEV, after it they end up success. we may need a lock around that once we fix the logic. Dave.