I'm seeing an easily-provoked livelock on quad-CPU sparc64 machines running RELENG_5. It's hard to get a good trace because the processes running on other CPUs cannot be traced from DDB, but I've been lucky a few times:
db> show alllocks Process 15 (swi1: net) thread 0xfffff8001fb07480 (100008) exclusive sleep mutex so_snd r = 0 (0xfffff800178432a8) locked @ netinet/tcp_input.c:2189 exclusive sleep mutex inp (tcpinp) r = 0 (0xfffff800155c3b08) locked @ netinet/tcp_input.c:744 exclusive sleep mutex tcp r = 0 (0xc0bdf788) locked @ netinet/tcp_input.c:617 db> wh 15 Tracing pid 15 tid 100008 td 0xfffff8001fb07480 sab_intr() at sab_intr+0x40 psycho_intr_stub() at psycho_intr_stub+0x8 intr_fast() at intr_fast+0x88 -- interrupt level=0xd pil=0 %o7=0xc01a0040 -- mb_free_ext() at mb_free_ext+0x28 sbdrop_locked() at sbdrop_locked+0x19c tcp_input() at tcp_input+0x2aa0 ip_input() at ip_input+0x964 netisr_processqueue() at netisr_processqueue+0x7c swi_net() at swi_net+0x120 ithread_loop() at ithread_loop+0x24c fork_exit() at fork_exit+0xd4 fork_trampoline() at fork_trampoline+0x8 db> That code is here in mb_free_ext(): /* * This is tricky. We need to make sure to decrement the * refcount in a safe way but to also clean up if we're the * last reference. This method seems to do it without race. */ while (dofree == 0) { cnt = *(m->m_ext.ref_cnt); if (atomic_cmpset_int(m->m_ext.ref_cnt, cnt, cnt - 1)) { if (cnt == 1) dofree = 1; break; } } mb_free_ext+0x24: casa 0x4 , %g2, %g1 mb_free_ext+0x28: subcc %g1, %g2, %g0 which is inside the atomic_cmpset_int (i.e. it's probably spinning in the loop). Can anyone see if there's a problem with this code, or perhaps the sparc64 implementation of atomic_cmpset_int()? Kris
pgpwPzRFqqbq6.pgp
Description: PGP signature