On Tue, 1 Aug 2006, Auke Kok wrote:
Stephane Doyon wrote:
The e1000_probe() function passes references to the netdev structure
before it's actually registered. In the (admittedly obscure) case where
the netdev registration fails, we are left with a dangling reference.
Specifically, e1000_probe() calls
netif_carrier_off(netdev);
before register_netdev(netdev).
(It also calls pci_set_drvdata(pdev, netdev) rather early, not sure how
important that is.)
netif_carrier_off() does linkwatch_fire_event(dev);, which in turn does
dev_hold(dev); and queues up an event with a reference to the netdev.
But the net_device reference counting mechanism only works on registered
netdevs.
Should the register_netdev() call fail, the error path does
free_netdev(netdev);, and when the event goes off, it accesses random
memory through the dangling reference.
I would recommend moving the register_netdev() call earlier.
We agree that this may be an issue and we're looking at how this mis-ordering
entered the code in the first place. I'm probably going to send a patch later
today or include it in this week-worths upstream patches later this week.
Thank you for looking at this.
We were wondering however how you encountered this problem? Did you see a
case where this race actually happened? it might be an interesting case to
look at. Or did you do this by code review only?
Yes I did see a case where this happened, but the failure in
register_netdev() was due to some unrelated kernel code modifications I
was working with. The effect of the dangling reference was an unhandled
"kernel paging request somewhere in the USB EHCI driver where some pointer
got corrupted. The USB modules were being inserted soon after the e1000. I
moved things around and eventually I put a sleep after the modprobe e1000,
and then I got a "BUG at kernel/timer.c:279!" instead, and the backtrace
showed mod_timer() <= __netdev_watchdog_up() <= ... <= dev_activate)( <=
linkwatch_run_queue()... I figured out the problem from there. In
e1000_probe(), I moved netif_carrier_off() (and pci_set_drvdata( and
netif_stop_queue() too for good measure) to after the register_netdev()
call, and that made the weird effects go away. The error path in question
is pretty obscure, but it wasn't easy working backward from the memory
corruption effect, so that's my extra incentive for reporting the problem.
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at http://vger.kernel.org/majordomo-info.html