On Wed, Dec 27, 2006 at 08:16:10AM -0800, Ben Greear wrote: > Jarek Poplawski wrote: > >On Fri, Dec 22, 2006 at 06:05:18AM -0800, Ben Greear wrote: > >>Jarek Poplawski wrote: > >>>On Fri, Dec 22, 2006 at 08:13:08AM +0100, Jarek Poplawski wrote: > >>>>On 20-12-2006 03:13, Ben Greear wrote: > >>>>>This is from 2.6.18.2 kernel with my patch set. The MAC-VLANs are in > >>>>>active use. > >>>>>From the backtrace, I am thinking this might be a generic problem, > >>>>>however. > >>>>> > >>>>>Any ideas about what this could be? It seems to be reproducible every > >>>>>day or > >>>... > >>>>If it doesn't help, I hope lockdep will be more > >>>>precise when you'll upgrade to 2.6.19 or higher. > >>>... or when you enable lockdep in 2.6.18 (I've > >>>forgotten it's there alredy!). > >>I got lucky..the system was available by ssh still. I see this in the > >>boot logs..I assume > >>this means lockdep is enabled? Should I have expected to see a lockdep > >>trace in the case of > >>his soft-lockup then? > >> > >>..... > >>Dec 19 04:33:48 localhost kernel: Lock dependency validator: Copyright > >>(c) 2006 Red Hat, Inc., Ingo MolnarDec 19 04:33:48 localhost kernel: ... > >>MAX_LOCKDEP_SUBCLASSES: 8 > > > >Yes, you got it enabled in the config. > > > >If there is no message later about validator > >turning off and no warnings which could point > >at lockdep then it is working. > > > >But then, IMHO, there is rather small probability > >this bug is really from lockup. Another possibility > >is hardware irqs (timer in particular) are turned > >off by something (maybe those hacks?) for extremely > >long time (~10 sec.). > > The system hangs and does not recover (well, a few processes > continue on the other processor for a few minutes before they > too deadlock...) > > I am guessing this problem has been around for a while, but it > is only triggered when interfaces are created, and probably only > when UDP traffic is already running heavily on the system. Most > systems w/out virtual devices will not trigger this sort of > race.
I'd one more look at this considering the info about creating interfaces and here are some of my doubts on possible races (I hope you'll forgive me if I totaly miss some point): - During register procedure the real device seems to be up and running; vlan_rx_register is used but I see drivers differ here: some of them do netif_stop and disable irqs while others only lock. It seems they can start do vlan_hwaccel_rx directly after this (sometimes even during registration if irq will happen). - vlan_hwaccel_rx is checking skb_bond_should_drop but I'm not sure it is really useful here, so probably at least broadcasts and multicasts can use netif_rx even before vlan_dev is up (and your log accidentally shows multicast receive). - Preemption is blocked for quite a long time in vlan_skb_recv and during netif_receive; I guess this could be also possible reason of triggering the softlockup bug. I wonder if lowering the value of netdev_max_backlog wouldn't improve scheduling times. Happy New Year, Jarek P. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html