What's needed is some way to binary search where the culprit is.

If I understand correctly, it looks like the crash is happening in the
later stages of board bring-up? What is running before that? Can parts
be disabled or skipped to see if the problem goes away?

Another idea is to try running a static analysis tool on the sources
and see if it finds anything suspicious to be looked into more
carefully.


On Mon, Mar 11, 2024 at 10:00 AM Gregory Nutt <spudan...@gmail.com> wrote:
>
> The reason that the error is confusing is because the error probably did
> not occur at the time of the assertion; it probably occurred much earlier.
>
> In most crashes due to heap corruption there are two players:  the
> culprit and the victim threads.  The culprit thread actually cause the
> corruption.  But at the time of the corruption, no error occurs.  The
> error will not occur until later.
>
> So sometime later, the victim thread runs, encounters the clobbered heap
> and crashes.  In this case, "AppBringup" and "rptun" are potential
> victim threads.  The fact that they crash tell you very little about the
> culprit.
>
> On 3/10/2024 6:51 PM, yfliu2008 wrote:
> > Gregory, thank you for the analysis.
> >
> >
> >
> >
> > The crashes happened during system booting up, mostly at "AppBringup" or 
> > "rptun" threads, as per the assertion logs. The other threads existing are 
> > the "idle" and the "lpwork" threads as per the sched logs. There should be 
> > no other threads as NSH creation is still ongoing.&nbsp; As for 
> > interruptions, the UART and IPI are running in kernel space and MTIMER are 
> > in NuttSBI space.&nbsp; The NSH is loaded from a RPMSGFS volume, thus there 
> > are a lot RPMSG communications.
> >
> >
> >
> >
> > Is the KASAN proper for use in Kernel mode?&nbsp;
> >
> >
> > With MM_KASAN_ALL it reports a read access error:
> >
> >
> >
> > BCkasan_report: kasan detected a read access error, address at 
> > 0x708fe90,size is 8, return address: 0x701aeac
> >
> > _assert: Assertion failed panic: at file: kasan/kasan.c:117 task: Idle_Task 
> > process: Kernel 0x70023c0
> >
> >
> > The call stack looks like:
> >
> >
> > #0 &nbsp;_assert (filename=0x7060f78 "kasan/kasan.c", linenum=117, 
> > msg=0x7060ff0 "panic", regs=0x7082720 <g_last_regs&gt;) at 
> > misc/assert.c:536#1 &nbsp;0x0000000007010248 in __assert 
> > (filename=0x7060f78 "kasan/kasan.c", linenum=117, msg=0x7060ff0 "panic") at 
> > assert/lib_assert.c:36
> > #2 &nbsp;0x00000000070141d6 in kasan_report (addr=0x708fe90, size=8, 
> > is_write=false, return_address=0x701aeac <riscv_swint+114&gt;)&nbsp; at 
> > kasan/kasan.c:117
> > #3 &nbsp;0x0000000007014412 in kasan_check_report (addr=0x708fe90, size=8, 
> > is_write=false, return_address=0x701aeac <riscv_swint+114&gt;) at 
> > kasan/kasan.c:190
> > #4 &nbsp;0x000000000701468c in __asan_load8_noabort (addr=0x708fe90) at 
> > kasan/kasan.c:315
> > #5 &nbsp;0x000000000701aeac in riscv_swint (irq=0, context=0x708fe40, 
> > arg=0x0) at common/riscv_swint.c:133
> > #6 &nbsp;0x000000000701b8fe in riscv_perform_syscall (regs=0x708fe40) at 
> > common/supervisor/riscv_perform_syscall.c:45
> > #7 &nbsp;0x0000000007000570 in sys_call6 ()
> >
> >
> >
> > With MM_KASAN_DISABLE_READ_CHECKS=y, it reports:
> >
> >
> > _assert: Assertion failed : at file: mm_heap/mm_malloc.c:245 task: rptun 
> > process: Kernel 0x704a030
> >
> >
> > The call stack is:
> >
> >
> > #0 &nbsp;_assert (filename=0x7056060 "mm_heap/mm_malloc.c", linenum=245, 
> > msg=0x0, regs=0x7082720 <g_last_regs&gt;) at misc/assert.c:536#1 
> > &nbsp;0x000000000700df18 in __assert (filename=0x7056060 
> > "mm_heap/mm_malloc.c", linenum=245, msg=0x0) at assert/lib_assert.c:36
> > #2 &nbsp;0x0000000007013082 in mm_malloc (heap=0x7089c00, size=128) at 
> > mm_heap/mm_malloc.c:245
> > #3 &nbsp;0x0000000007011694 in kmm_malloc (size=128) at 
> > kmm_heap/kmm_malloc.c:51
> > #4 &nbsp;0x000000000704efd4 in metal_allocate_memory (size=128) at 
> > .../nuttx/include/metal/system/nuttx/alloc.h:27
> > #5 &nbsp;0x000000000704fd8a in rproc_virtio_create_vdev (role=1, notifyid=0,
> > &nbsp; &nbsp; rsc=0x80200050, rsc_io=0x7080408 <metal_io_region_&gt;, 
> > priv=0x708ecd8,
> > &nbsp; &nbsp; notify=0x704e6d2 <remoteproc_virtio_notify&gt;, rst_cb=0x0)
> > &nbsp; &nbsp; at open-amp/lib/remoteproc/remoteproc_virtio.c:356
> > #6 &nbsp;0x000000000704e956 in remoteproc_create_virtio (rproc=0x708ecd8,
> > &nbsp; &nbsp; vdev_id=0, role=1, rst_cb=0x0) at 
> > open-amp/lib/remoteproc/remoteproc.c:957
> > #7 &nbsp;0x000000000704b1ee in rptun_dev_start (rproc=0x708ecd8)
> > &nbsp; &nbsp; at rptun/rptun.c:757
> > #8 &nbsp;0x0000000007049ff8 in rptun_start_worker (arg=0x708eac0)
> > &nbsp; &nbsp; at rptun/rptun.c:233
> > #9 &nbsp;0x000000000704a0ac in rptun_thread (argc=3, argv=0x7092010)
> > &nbsp; &nbsp; at rptun/rptun.c:253
> > #10 0x000000000700437e in nxtask_start () at task/task_start.c:107
> >
> >
> > This looks like already corrupted.
> >
> >
> >
> > I also noticed there is a "mm_checkcorruption()" function, not sure how to 
> > use it yet.
> >
> >
> >
> > Regards,
> > yf
> >
> >
> >
> >
> >
> > Original
> >
> >
> >
> > From:"Gregory Nutt"< spudan...@gmail.com &gt;;
> >
> > Date:2024/3/11 1:43
> >
> > To:"dev"< dev@nuttx.apache.org &gt;;
> >
> > Subject:Re: mm/mm_heap assertion error
> >
> >
> > On 3/10/2024 4:38 AM, yfliu2008 wrote:
> > &gt; Dear experts,
> > &gt;
> > &gt;
> > &gt;
> > &gt;
> > &gt; When doing regression check on K230 with a previously working Kernel 
> > mode configuration, I got assertion error like below:
> > &gt;
> > &gt;
> > &gt;
> > &gt; #0 &nbsp;_assert (filename=0x704c598 "mm_heap/mm_malloc.c", 
> > linenum=245, msg=0x0,regs=0x7082730 &gt; #2 &nbsp;0x00000000070110f0 in 
> > mm_malloc (heap=0x7089c00, size=112) at mm_heap/mm_malloc.c:245
> > &gt; #3 &nbsp;0x000000000700fd74 in kmm_malloc (size=112) at 
> > kmm_heap/kmm_malloc.c:51
> > &gt; #4 &nbsp;0x0000000007028d4e in elf_loadphdrs (loadinfo=0x7090550) at 
> > libelf/libelf_sections.c:207
> > &gt; #5 &nbsp;0x0000000007028b0c in elf_load (loadinfo=0x7090550)&nbsp; at 
> > libelf/libelf_load.c:337
> > &gt; #6 &nbsp;0x00000000070278aa in elf_loadbinary (binp=0x708f5d0, 
> > filename=0x704bca8 "/system/bin/init", exports=0x0, nexports=0) at elf.c:257
> > &gt; #7 &nbsp;0x00000000070293ea in load_absmodule (bin=0x708f5d0, 
> > filename=0x704bca8 "/system/bin/init", exports=0x0, nexports=0) at 
> > binfmt_loadmodule.c:115
> > &gt; #8 &nbsp;0x0000000007029504 in load_module (bin=0x708f5d0, 
> > filename=0x704bca8 "/system/bin/init", exports=0x0, nexports=0)&nbsp; at 
> > binfmt_loadmodule.c:219
> > &gt; #9 &nbsp;0x0000000007027674 in exec_internal (filename=0x704bca8 
> > "/system/bin/init", argv=0x70907a0, envp=0x0, exports=0x0, nexports=0, 
> > actions=0x0, attr=0x7090788, spawn=true) at binfmt_exec.c:98
> > &gt; #10 0x000000000702779c in exec_spawn (filename=0x704bca8 
> > "/system/bin/init", argv=0x70907a0, envp=0x0, exports=0x0, nexports=0, 
> > actions=0x0, attr=0x7090788) at binfmt_exec.c:220
> > &gt; #11 0x000000000700299e in nx_start_application () at 
> > init/nx_bringup.c:375
> > &gt; #12 0x00000000070029f0 in nx_start_task (argc=1, argv=0x7090010) at 
> > init/nx_bringup.c:403
> > &gt; #13 0x0000000007003f84 in nxtask_start () at task/task_start.c:107
> > &gt;
> > &gt;
> > &gt;
> > &gt; It looks like mm/mm_heap data structure consistency was broken. As I 
> > am unfamilar with these internals, I am looking forward to&nbsp; any hints 
> > about how to find the root cause.
> > &gt;
> > &gt;
> > &gt;
> > &gt;
> > &gt;
> > &gt;
> > &gt;
> > &gt; Regards,
> > &gt;
> > &gt; yf
> >
> > This does indicate heap corruption:
> >
> >      240&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; /* Node next must be alloced, 
> > otherwise it should be merged.
> >      241&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; * Its prenode(the 
> > founded node) must be free and
> >      preceding should
> >      242&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; * match with nodesize.
> >      243&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; */
> >      244
> >      245&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 
> > DEBUGASSERT(MM_NODE_IS_ALLOC(next) &amp;&amp;
> >      MM_PREVNODE_IS_FREE(next) &amp;&amp;
> >      
> > 246&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
> >  next-&gt;preceding == nodesize);
> >
> > Heap corruption normally occurs when that this a wild write outside of
> > the allocated memory region.&nbsp; These kinds of wild writes may clobber
> > some other threads data and directory or indirectly clobber the heap
> > meta data.&nbsp; Trying to traverse the damages heap meta data is probably
> > the root cause of the problem.
> >
> > Only a kernel thread or interrupt handler could damage the heap.
> >
> > The cause of this corruption can be really difficult to find because the
> > reported error does not occur when the heap is damaged but may not
> > manifest itself until sometime later.
> >
> > It is unlikely that anyone will be able to solve this by just talking
> > about it.&nbsp; It might be worth increasing some kernel thread heap sizes
> > just to eliminate that common cause.
>
>

Reply via email to