2009/7/16 Michael Ellerman <mich...@ellerman.id.au>: > On Thu, 2009-07-16 at 09:33 -0700, Mike Mason wrote: >> Michael Ellerman wrote: >> > On Wed, 2009-07-15 at 14:43 -0700, Mike Mason wrote: >> >> This patch increments the device_node reference counter when an EEH >> >> error occurs and decrements the counter when the event has been >> >> handled. This is to prevent the device_node from being released until >> >> eeh_event_handler() has had a chance to deal with the event. We've >> >> seen cases where the device_node is released too soon when an EEH >> >> event occurs during a dlpar remove, causing the event handler to >> >> attempt to access bad memory locations. >> >> >> >> Please review and let me know of any concerns. >> > >> > Taking a reference sounds sane, but ... >> > >> >> Signed-off-by: Mike Mason <mm...@us.ibm.com> >> >> >> >> --- a/arch/powerpc/platforms/pseries/eeh_event.c 2008-10-09 >> >> 15:13:53.000000000 -0700 >> >> +++ b/arch/powerpc/platforms/pseries/eeh_event.c 2009-07-14 >> >> 14:14:00.000000000 -0700 >> >> @@ -75,6 +75,14 @@ static int eeh_event_handler(void * dumm >> >> if (event == NULL) >> >> return 0; >> >> >> >> + /* EEH holds a reference to the device_node, so if it >> >> + * equals 1 it's no longer valid and the event should >> >> + * be ignored */ >> >> + if (atomic_read(&event->dn->kref.refcount) == 1) { >> >> + of_node_put(event->dn); >> >> + return 0; >> >> + } >> > >> > That's really gross :) >> >> Agreed. I'll look for another way to determine if device is gone and >> the event should be ignored. Suggestions are welcome :-) > > Benh and I had a quick chat about it, and were wondering whether what > you really should be doing is taking a reference to the pci device > (perhaps as well as the device node). > > @@ -140,7 +149,7 @@ int eeh_send_failure_event (struct devic > if (dev) > pci_dev_get(dev); > > - event->dn = dn; > + event->dn = of_node_get(dn); > event->dev = dev; > > pci devs are refcounted too, see pci_dev_get(), so taking a reference > there would be the "right" thing to do - otherwise there's no guarantee > it still exists later, unless there's some other trick in the EEH code.
I thought that the eeh code did pci gets and puts in the right locations, perhaps I (incorrectly) assumed that this meant that the of_dn use count never dropped to zero ... I think my logic was: -- pci device init does of_node_get -- pci device shutdown does of_node_put -- pci device shutdown can never run as long as pci use count is > 0 Thus, explicit of_node_get was usually not needed. So, for example, see above: I was figuring that the pci_dev_get(dev); was enough to protect the dn too .. although maybe if dev is null, then things go wrong ... --linas _______________________________________________ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev