I'm running a program which does lots of semop calls. After a while something strange starts happening. sleep hangs forever and if I run date I see the clock on the machine not advancing.
In fact if I run "while : ; do date +%s,%N" I see the clock advancing find for 4.684 seconds then warping back that far. Repeatedly. It always warps back to the same point too. I'm getting tons of "soft lockup detected" in random places. Often in update_process_times or in try_atomic_semop which is what the program spends all its time doing so I guess it's not too surprising. This is on 2.6.21-1.3194.fc7. I've been trying different kernels but this is the first point where I've narrowed down the symptoms to something reproducible. Likely related (though surely this isn't the expected symptom?) there are also memory errors in the logs. I'm not sure how to tell which DIMM this is on though. MC1 means the memory controller on the second processor? Or the second core on the first processor? And row 1 channel 1 means the second DIMM? EDAC k8 MC1: general bus error: participating processor(local node origin), time-out(no timeout) memory transaction type(generic read), mem or i/o(mem access), cache level(generic) EDAC MC1: CE page 0xaf42f, offset 0xfe0, grain 8, syndrome 0xf9b2, row 1, channel 1, label "": k8_edac EDAC MC1: CE - no information available: k8_edac Error Overflow set EDAC k8 MC1: extended error code: ECC chipkill x4 error The memory errors don't occur at the same time as the soft lockup and timer problems though. [EMAIL PROTECTED] log]# while read x ; do date ; done Tue Aug 14 17:30:46 EDT 2007 Tue Aug 14 17:30:46 EDT 2007 Tue Aug 14 17:30:47 EDT 2007 Tue Aug 14 17:30:48 EDT 2007 Tue Aug 14 17:30:48 EDT 2007 Tue Aug 14 17:30:45 EDT 2007 Tue Aug 14 17:30:45 EDT 2007 Tue Aug 14 17:30:46 EDT 2007 Tue Aug 14 17:30:47 EDT 2007 Tue Aug 14 17:30:47 EDT 2007 Tue Aug 14 17:30:48 EDT 2007 Tue Aug 14 17:30:48 EDT 2007 Tue Aug 14 17:30:44 EDT 2007 Tue Aug 14 17:30:45 EDT 2007 ... $ while : ; do date +%s,%N ; done ... 1187127049,398740671 1187127049,400855143 1187127049,402963469 1187127049,404834931 1187127044,720649164 1187127044,722776205 1187127044,724720011 1187127044,726728620 ... BUG: soft lockup detected on CPU#1! [<c0451f3e>] softlockup_tick+0xa5/0xb4 [<c042e930>] update_process_times+0x3b/0x5e [<c043d2bd>] tick_sched_timer+0x78/0xbb [<c0439df5>] hrtimer_interrupt+0x12b/0x1b6 [<c043d245>] tick_sched_timer+0x0/0xbb [<c0419c40>] smp_apic_timer_interrupt+0x6f/0x80 [<c04059bc>] apic_timer_interrupt+0x28/0x30 [<c0403d60>] default_idle+0x0/0x3e [<c0403d8c>] default_idle+0x2c/0x3e [<c04033d6>] cpu_idle+0xa3/0xc4 ======================= BUG: soft lockup detected on CPU#3! [<c0451f3e>] softlockup_tick+0xa5/0xb4 [<c042e930>] update_process_times+0x3b/0x5e [<c043d2bd>] tick_sched_timer+0x78/0xbb [<c0439df5>] hrtimer_interrupt+0x12b/0x1b6 [<c043d245>] tick_sched_timer+0x0/0xbb [<c0419c40>] smp_apic_timer_interrupt+0x6f/0x80 [<c04059bc>] apic_timer_interrupt+0x28/0x30 [<c04b4d42>] try_atomic_semop+0xce/0x10e [<c04b4dda>] update_queue+0x28/0x7d [<c04b544f>] sys_semtimedop+0x46f/0x635 [<c043ad9a>] clocksource_get_next+0x39/0x3f [<c042e38b>] do_timer+0x4f2/0x6f3 [<c043dc75>] timer_stats_update_stats+0x142/0x14d [<c043d245>] tick_sched_timer+0x0/0xbb [<c0420bea>] enqueue_task+0x29/0x39 [<c0420e47>] __activate_task+0x1c/0x29 [<c043ad9a>] clocksource_get_next+0x39/0x3f [<c043ad9a>] clocksource_get_next+0x39/0x3f [<c042e38b>] do_timer+0x4f2/0x6f3 [<c043ad9a>] clocksource_get_next+0x39/0x3f [<c042e38b>] do_timer+0x4f2/0x6f3 [<c043dc75>] timer_stats_update_stats+0x142/0x14d [<c043d245>] tick_sched_timer+0x0/0xbb [<c043d245>] tick_sched_timer+0x0/0xbb [<c042d6e7>] getnstimeofday+0x30/0xb6 [<c042d6e7>] getnstimeofday+0x30/0xb6 [<c043c220>] clockevents_program_event+0xb2/0xb9 [<c040592c>] reschedule_interrupt+0x28/0x30 [<c04e4f08>] rb_insert_color+0x8c/0xad [<c043917f>] __remove_hrtimer+0x64/0x6b [<c043986c>] hrtimer_try_to_cancel+0x65/0x6d [<c042a0ce>] do_setitimer+0xb7/0x2d3 [<c0451069>] audit_syscall_exit+0x294/0x2b0 [<c0409298>] sys_ipc+0x5e/0x1bb [<c0407d2e>] do_syscall_trace+0xbb/0xc2 [<c0404f70>] syscall_call+0x7/0xb ======================= BUG: soft lockup detected on CPU#3! [<c0451f3e>] softlockup_tick+0xa5/0xb4 [<c042e930>] update_process_times+0x3b/0x5e [<c043d2bd>] tick_sched_timer+0x78/0xbb [<c0439df5>] hrtimer_interrupt+0x12b/0x1b6 [<c043d245>] tick_sched_timer+0x0/0xbb [<c05a8449>] neigh_timer_handler+0x0/0x26f [<c0419c40>] smp_apic_timer_interrupt+0x6f/0x80 [<c04059bc>] apic_timer_interrupt+0x28/0x30 [<c05a8449>] neigh_timer_handler+0x0/0x26f [<c042dcee>] run_timer_softirq+0x10a/0x17b [<c05a8449>] neigh_timer_handler+0x0/0x26f [<c042a588>] it_real_fn+0x12/0x16 [<c042b2e5>] __do_softirq+0x5d/0xba [<c04071b7>] do_softirq+0x59/0xb1 [<c042b1c7>] ksoftirqd+0x0/0xc1 [<c042b226>] ksoftirqd+0x5f/0xc1 [<c0436da8>] kthread+0xb0/0xd8 [<c0436cf8>] kthread+0x0/0xd8 [<c0405b3f>] kernel_thread_helper+0x7/0x10 ======================= BUG: soft lockup detected on CPU#3! [<c0451f3e>] softlockup_tick+0xa5/0xb4 [<c042e930>] update_process_times+0x3b/0x5e [<c043d2bd>] tick_sched_timer+0x78/0xbb [<c0439df5>] hrtimer_interrupt+0x12b/0x1b6 [<c043d245>] tick_sched_timer+0x0/0xbb [<c0419c40>] smp_apic_timer_interrupt+0x6f/0x80 [<c04059bc>] apic_timer_interrupt+0x28/0x30 [<c0403d60>] default_idle+0x0/0x3e [<c0403d8c>] default_idle+0x2c/0x3e [<c04033d6>] cpu_idle+0xa3/0xc4 ======================= BUG: soft lockup detected on CPU#3! [<c0451f3e>] softlockup_tick+0xa5/0xb4 [<c042e930>] update_process_times+0x3b/0x5e [<c043d2bd>] tick_sched_timer+0x78/0xbb [<c0439df5>] hrtimer_interrupt+0x12b/0x1b6 [<c043d245>] tick_sched_timer+0x0/0xbb [<c0419c40>] smp_apic_timer_interrupt+0x6f/0x80 [<c04059bc>] apic_timer_interrupt+0x28/0x30 [<c042a2e0>] do_setitimer+0x2c9/0x2d3 [<c042a35e>] sys_setitimer+0x74/0x7f [<c0404f70>] syscall_call+0x7/0xb ======================= BUG: soft lockup detected on CPU#0! [<c0451f3e>] softlockup_tick+0xa5/0xb4 [<c042e930>] update_process_times+0x3b/0x5e [<c043d2bd>] tick_sched_timer+0x78/0xbb [<c0439df5>] hrtimer_interrupt+0x12b/0x1b6 [<c043d245>] tick_sched_timer+0x0/0xbb [<c045e707>] wb_timer_fn+0x0/0x27 [<c0419c40>] smp_apic_timer_interrupt+0x6f/0x80 [<c04059bc>] apic_timer_interrupt+0x28/0x30 [<c045e707>] wb_timer_fn+0x0/0x27 [<c042dd5a>] run_timer_softirq+0x176/0x17b [<c042b2e5>] __do_softirq+0x5d/0xba [<c04071b7>] do_softirq+0x59/0xb1 [<c042b1c7>] ksoftirqd+0x0/0xc1 [<c042b226>] ksoftirqd+0x5f/0xc1 [<c0436da8>] kthread+0xb0/0xd8 [<c0436cf8>] kthread+0x0/0xd8 [<c0405b3f>] kernel_thread_helper+0x7/0x10 ======================= BUG: soft lockup detected on CPU#0! [<c0451f3e>] softlockup_tick+0xa5/0xb4 [<c042e930>] update_process_times+0x3b/0x5e [<c043d2bd>] tick_sched_timer+0x78/0xbb [<c0439df5>] hrtimer_interrupt+0x12b/0x1b6 [<c043d245>] tick_sched_timer+0x0/0xbb [<c0419c40>] smp_apic_timer_interrupt+0x6f/0x80 [<c04059bc>] apic_timer_interrupt+0x28/0x30 [<c04b4c94>] try_atomic_semop+0x20/0x10e [<c04b4dda>] update_queue+0x28/0x7d [<c04b544f>] sys_semtimedop+0x46f/0x635 [<c043ad9a>] clocksource_get_next+0x39/0x3f [<c042e38b>] do_timer+0x4f2/0x6f3 [<c043dc75>] timer_stats_update_stats+0x142/0x14d [<c043d245>] tick_sched_timer+0x0/0xbb [<c0420bea>] enqueue_task+0x29/0x39 [<c0420e47>] __activate_task+0x1c/0x29 [<c043ad9a>] clocksource_get_next+0x39/0x3f [<c043ad9a>] clocksource_get_next+0x39/0x3f [<c042e38b>] do_timer+0x4f2/0x6f3 [<c043dc75>] timer_stats_update_stats+0x142/0x14d [<c042e38b>] do_timer+0x4f2/0x6f3 [<c043dc75>] timer_stats_update_stats+0x142/0x14d [<c0420bea>] enqueue_task+0x29/0x39 [<c0420e47>] __activate_task+0x1c/0x29 [<c0422295>] try_to_wake_up+0x3aa/0x3b4 [<c042d6e7>] getnstimeofday+0x30/0xb6 [<c042d6e7>] getnstimeofday+0x30/0xb6 [<c043c220>] clockevents_program_event+0xb2/0xb9 [<c043cfe9>] tick_program_event+0x3a/0x59 [<c043dc75>] timer_stats_update_stats+0x142/0x14d [<c0439e57>] hrtimer_interrupt+0x18d/0x1b6 [<c042d6e7>] getnstimeofday+0x30/0xb6 [<c043c220>] clockevents_program_event+0xb2/0xb9 [<c04e4ec8>] rb_insert_color+0x4c/0xad [<c043917f>] __remove_hrtimer+0x64/0x6b [<c043986c>] hrtimer_try_to_cancel+0x65/0x6d [<c042a0ce>] do_setitimer+0xb7/0x2d3 [<c0451069>] audit_syscall_exit+0x294/0x2b0 [<c0409298>] sys_ipc+0x5e/0x1bb [<c0407d2e>] do_syscall_trace+0xbb/0xc2 [<c0404f70>] syscall_call+0x7/0xb ======================= BUG: soft lockup detected on CPU#2! [<c0451f3e>] softlockup_tick+0xa5/0xb4 [<c042e930>] update_process_times+0x3b/0x5e [<c043d2bd>] tick_sched_timer+0x78/0xbb [<c0439df5>] hrtimer_interrupt+0x12b/0x1b6 [<c043d245>] tick_sched_timer+0x0/0xbb [<c042de8b>] process_timeout+0x0/0x5 [<c0419c40>] smp_apic_timer_interrupt+0x6f/0x80 [<c04059bc>] apic_timer_interrupt+0x28/0x30 [<c042de8b>] process_timeout+0x0/0x5 [<c042dcee>] run_timer_softirq+0x10a/0x17b [<c042de8b>] process_timeout+0x0/0x5 [<c042b2e5>] __do_softirq+0x5d/0xba [<c04071b7>] do_softirq+0x59/0xb1 [<c042b1c7>] ksoftirqd+0x0/0xc1 [<c042b226>] ksoftirqd+0x5f/0xc1 [<c0436da8>] kthread+0xb0/0xd8 [<c0436cf8>] kthread+0x0/0xd8 [<c0405b3f>] kernel_thread_helper+0x7/0x10 ======================= BUG: soft lockup detected on CPU#3! [<c0451f3e>] softlockup_tick+0xa5/0xb4 [<c042e930>] update_process_times+0x3b/0x5e [<c043d2bd>] tick_sched_timer+0x78/0xbb [<c0439df5>] hrtimer_interrupt+0x12b/0x1b6 [<c043d245>] tick_sched_timer+0x0/0xbb [<c0419c40>] smp_apic_timer_interrupt+0x6f/0x80 [<c04059bc>] apic_timer_interrupt+0x28/0x30 [<c0403d60>] default_idle+0x0/0x3e [<c0403d8c>] default_idle+0x2c/0x3e [<c04033d6>] cpu_idle+0xa3/0xc4 ======================= BUG: soft lockup detected on CPU#0! [<c0451f3e>] softlockup_tick+0xa5/0xb4 [<c042e930>] update_process_times+0x3b/0x5e [<c043d2bd>] tick_sched_timer+0x78/0xbb [<c0439df5>] hrtimer_interrupt+0x12b/0x1b6 [<c043d245>] tick_sched_timer+0x0/0xbb [<c0419c40>] smp_apic_timer_interrupt+0x6f/0x80 [<c04394c4>] ktime_get_ts+0x16/0x44 [<c04059bc>] apic_timer_interrupt+0x28/0x30 [<c0403d60>] default_idle+0x0/0x3e [<c0403d8c>] default_idle+0x2c/0x3e [<c04033d6>] cpu_idle+0xa3/0xc4 [<c071bb5f>] start_kernel+0x435/0x43d [<c071b25a>] unknown_bootoption+0x0/0x202 ======================= BUG: soft lockup detected on CPU#0! [<c0451f3e>] softlockup_tick+0xa5/0xb4 [<c042e930>] update_process_times+0x3b/0x5e [<c043d2bd>] tick_sched_timer+0x78/0xbb [<c0439df5>] hrtimer_interrupt+0x12b/0x1b6 [<c043d245>] tick_sched_timer+0x0/0xbb [<c0419c40>] smp_apic_timer_interrupt+0x6f/0x80 [<c04059bc>] apic_timer_interrupt+0x28/0x30 [<c0403d60>] default_idle+0x0/0x3e [<c0403d8c>] default_idle+0x2c/0x3e [<c04033d6>] cpu_idle+0xa3/0xc4 [<c071bb5f>] start_kernel+0x435/0x43d [<c071b25a>] unknown_bootoption+0x0/0x202 ======================= BUG: soft lockup detected on CPU#2! [<c0451f3e>] softlockup_tick+0xa5/0xb4 [<c042e930>] update_process_times+0x3b/0x5e [<c043d2bd>] tick_sched_timer+0x78/0xbb [<c0439df5>] hrtimer_interrupt+0x12b/0x1b6 [<c043d245>] tick_sched_timer+0x0/0xbb [<c0419c40>] smp_apic_timer_interrupt+0x6f/0x80 [<c04394c4>] ktime_get_ts+0x16/0x44 [<c04059bc>] apic_timer_interrupt+0x28/0x30 [<c0403d60>] default_idle+0x0/0x3e [<c0403d8c>] default_idle+0x2c/0x3e [<c04033d6>] cpu_idle+0xa3/0xc4 ======================= BUG: soft lockup detected on CPU#1! [<c0451f3e>] softlockup_tick+0xa5/0xb4 [<c042e930>] update_process_times+0x3b/0x5e [<c043d2bd>] tick_sched_timer+0x78/0xbb [<c0439df5>] hrtimer_interrupt+0x12b/0x1b6 [<c043d245>] tick_sched_timer+0x0/0xbb [<c0419c40>] smp_apic_timer_interrupt+0x6f/0x80 [<c04059bc>] apic_timer_interrupt+0x28/0x30 [<c060171d>] _spin_unlock_irqrestore+0x8/0x9 [<f89de6fe>] mdio_ctrl+0xc4/0x12b [e100] [<c05bc5bd>] ip_rcv+0x40c/0x445 [<f89e0c59>] e100_watchdog+0x0/0x325 [e100] [<f89de801>] mdio_read+0x1d/0x23 [e100] [<f89b9221>] mii_link_ok+0x1d/0x25 [mii] [<f89e0ca1>] e100_watchdog+0x48/0x325 [e100] [<c042dbd4>] cascade+0x3f/0x4f [<c042dcfa>] run_timer_softirq+0x116/0x17b [<f89e0c59>] e100_watchdog+0x0/0x325 [e100] [<c05a39b8>] net_rx_action+0x94/0x17f [<c042b2e5>] __do_softirq+0x5d/0xba [<c04071b7>] do_softirq+0x59/0xb1 [<c042b1c7>] ksoftirqd+0x0/0xc1 [<c042b226>] ksoftirqd+0x5f/0xc1 [<c0436da8>] kthread+0xb0/0xd8 [<c0436cf8>] kthread+0x0/0xd8 [<c0405b3f>] kernel_thread_helper+0x7/0x10 ======================= BUG: soft lockup detected on CPU#1! [<c0451f3e>] softlockup_tick+0xa5/0xb4 [<c042e930>] update_process_times+0x3b/0x5e [<c043d2bd>] tick_sched_timer+0x78/0xbb [<c0439df5>] hrtimer_interrupt+0x12b/0x1b6 [<c043d245>] tick_sched_timer+0x0/0xbb [<c0419c40>] smp_apic_timer_interrupt+0x6f/0x80 [<c04059bc>] apic_timer_interrupt+0x28/0x30 [<c0403d60>] default_idle+0x0/0x3e [<c0403d8c>] default_idle+0x2c/0x3e [<c04033d6>] cpu_idle+0xa3/0xc4 ======================= BUG: soft lockup detected on CPU#2! [<c0451f3e>] softlockup_tick+0xa5/0xb4 [<c042e930>] update_process_times+0x3b/0x5e [<c043d2bd>] tick_sched_timer+0x78/0xbb [<c0439df5>] hrtimer_interrupt+0x12b/0x1b6 [<c043d245>] tick_sched_timer+0x0/0xbb [<c0419c40>] smp_apic_timer_interrupt+0x6f/0x80 [<c04059bc>] apic_timer_interrupt+0x28/0x30 [<c0403d60>] default_idle+0x0/0x3e [<c0403d8c>] default_idle+0x2c/0x3e [<c04033d6>] cpu_idle+0xa3/0xc4 ======================= -- Gregory Stark EnterpriseDB http://www.enterprisedb.com - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/