On 27.12.2021 12:31, Gleb Smirnoff wrote:
> On Fri, Dec 17, 2021 at 01:27:11PM -0600, Larry Rosenman wrote:
> L> Can someone look at the messages I posted to -CURRENT, most recent 
> L> today, with random
> L> Callout(?) crashes after long (>6 hour) poudriere runs?
> L> 
> L> I have core's available.
> 
> I asked Larry to obtain a core with INVARIANTS and now we have one.
> 
> Sharing what I've found to brainstorm. Trap happens in LIST_REMOVE()
> kern_timeout.c:488 because the entry doesn't have a prev pointer, e.g.
> doesn't belong to any list.
> 
> #6  0xffffffff807be075 in trap_pfault (frame=0xfffffe02d3393d50, 
> usermode=false, signo=<optimized out>, ucode=<optimized out>)
>     at /usr/src/sys/amd64/amd64/trap.c:765
> #7  <signal handler called>
> #8  0xffffffff804e5609 in callout_process (now=now@entry=100465191785818) at 
> /usr/src/sys/kern/kern_timeout.c:488
> #9  0xffffffff80460fc5 in handleevents (now=now@entry=100465191785818, 
> fake=fake@entry=0) at /usr/src/sys/kern/kern_clocksource.c:213
> #10 0xffffffff80461a66 in timercb (et=0xffffffff80d47980 <lapic_et>, 
> arg=<optimized out>) at /usr/src/sys/kern/kern_clocksource.c:357
> #11 0xffffffff807e6beb in lapic_handle_timer (frame=0xfffffe02d3393f40) at 
> /usr/src/sys/x86/x86/local_apic.c:1364
> 
> (kgdb) p *tmp
> $13 = {c_links = {le = {le_next = 0x0, le_prev = 0x0}, sle = {sle_next = 
> 0x0}, tqe = {tqe_next = 0x0, tqe_prev = 0x0}}, c_time = 0, 
>   c_precision = 0, c_arg = 0x0, c_func = 0x0, c_lock = 0xfffff8030521e670, 
> c_flags = 0, c_iflags = 0, c_cpu = 0}
> 
> Useful here is the c_lock, which points into "process lock" lockobject.
> 
> This allows us to deduct that the callout belongs to proc subsystem and
> we can retrieve the proc it points to: c_lock - 0x128 = 0xfffff8030521e548
> It is ccache in PRS_NORMAL state. And the "tmp" in our stack frame is its
> p_itcallout.
> 
> So there is something that would zero out most of the p_itcallout while
> it is scheduled?

So carefully zero it, but keep the lock pointer...  The only way that
comes to mind is callout_init_mtx() in do_fork() if we assume the
process has completed and the struct proc was reused.  I guess if we
could somehow leak scheduled callout in exit1().  May be we could add
some more assertions to try catch callout still being active there.

-- 
Alexander Motin

Reply via email to