On Mon, 25 Apr 2022 15:07:58 +0000 Pedro Miguel Justo <pm...@texair.net> wrote:
> > On 2022/Apr/25, at 01:22, Pedro Miguel Justo <pm...@texair.net> wrote: > > > > > > > >> On 2022/Apr/25, at 01:14, Frank Scheiner <frank.schei...@web.de> wrote: > >> > >> Hi guys, > >> > >> On 25.04.22 10:09, John Paul Adrian Glaubitz wrote: > >>>> From what I can understand by the information in the bugcheck, this is > >>>> somewhat related to a violation > >>>> in parameter copy from user to kernel during some boot-time, crypto, > >>>> self-test. Does that sound right? > >>>> If that is the case, how would this be related to FW? > >>> > >>> I'm not claiming that it must be related to the firmware, I'm just saying > >>> that I don't see this problem > >>> on my RX2660 at all and I have even reinstalled it recently with one of > >>> the latest firmware images > >>> without having to pass any parameter to the command line. > >> > >> A difference between Adrian's rx2660 and Pedro's rx2660 is Montecito > >> left and Montvale right. > >> > >> But could still be multiple other reasons we haven't looked at yet in > >> detail: > >> > >> * amount of memory installed > >> * SMT enabled or not > >> * number of processor modules installed > >> > >> It might be possible for me to check on my rx2660s (one with Montvale > >> and one with Montecito(s)) tomorrow. I will then also look at my other > >> Itanium gear and gather relevant information. > >> > > > > Yes, this sounds mode likely to me too. > > > > The crypto self-tests seem to be an innocent bystander here. I tried > > booting the most recent kernel with the option “cryptomgr.notests” and it > > went much farther. Alas it still failed with another buffer copy validation > > for a different caller altogether: > > > > [ 3.836466] [<a000000101353690>] usercopy_abort+0x120/0x130 > > [ 3.836466] sp=e0000001000cfdf0 > > bsp=e0000001000c9388 > > [ 3.836466] [<a0000001004c5660>] __check_object_size+0x3c0/0x420 > > [ 3.836466] sp=e0000001000cfe00 > > bsp=e0000001000c9350 > > [ 3.836466] [<a000000100570030>] sys_getcwd+0x250/0x420 > > [ 3.836466] sp=e0000001000cfe00 > > bsp=e0000001000c92c8 > > [ 3.836466] [<a00000010000c860>] ia64_ret_from_syscall+0x0/0x20 > > [ 3.836466] sp=e0000001000cfe30 > > bsp=e0000001000c92c8 > > [ 3.836466] [<a000000000040720>] ia64_ivt+0xffffffff00040720/0x400 > > [ 3.836466] sp=e0000001000d0000 > > bsp=e0000001000c92c8 > > > > This suggests the bug might be in the logic validating these buffers > > against the allocations (heap, span, etc). > > > > I don’t know why hardened_usercopy=off is not being observed by the kernel. > > As a work-around I am copying myself a new kernel with > > CONFIG_HARDENED_USERCOPY disabled at the source. > > > > Even with kernel "Linux debian 4.19.0-5-mckinley #1 SMP Debian 4.19.37-5 > (2019-06-19) ia64 GNU/Linux" > > Things are still not 100%. After a few hours into building the kernel it > started crashing also with usercopy validations but, this time, the other way > around. And because it was the other way around, it led to process > termination instead of full-blown bugcheck. This could be related or not. > Coule very well be a different bug that happens to manifest itself round the > same validation. > > CC [M] drivers/net/wireless/realtek/rtw88/rtw8822be.o > LD [M] drivers/net/wireless/realtek/rtw88/rtw88_8822be.o > CC [M] drivers/net/wireless/realtek/rtw88/rtw8822c.o > Segmentation fault > make[5]: *** [scripts/Makefile.build:293: > drivers/net/wireless/realtek/rtw88/rtw8822c.o] Error 139 > make[5]: *** Deleting file 'drivers/net/wireless/realtek/rtw88/rtw8822c.o' > make[4]: *** [scripts/Makefile.build:555: drivers/net/wireless/realtek/rtw88] > Error 2 > make[3]: *** [scripts/Makefile.build:555: drivers/net/wireless/realtek] Error > 2 > make[2]: *** [scripts/Makefile.build:555: drivers/net/wireless] Error 2 > make[1]: *** [scripts/Makefile.build:555: drivers/net] Error 2 > make: *** [Makefile:1855: drivers] Error 2 > pmsjt@debian:~/linux-source-5.17$ make > > Message from syslogd@debian at Apr 25 07:58:08 ... > kernel:[23420.984012] usercopy: Kernel memory overwrite attempt detected to > linear kernel text (offset 1916912, size 8)! > > Message from syslogd@debian at Apr 25 07:58:08 ... > kernel:[23421.268009] usercopy: Kernel memory overwrite attempt detected to > linear kernel text (offset 1818608, size 8)! > HOSTCC scripts/sign-file > CALL scripts/checksyscalls.sh > <stdin>:1517:2: warning: #warning syscall clone3 not implemented [-Wcpp] > CALL scripts/atomic/check-atomics.sh > CHK include/generated/compile.h > make[2]: *** [scripts/Makefile.build:294: arch/ia64/kernel/signal.o] > Segmentation fault > > Message from syslogd@debian at Apr 25 07:58:11 ... > kernel:[23423.626254] usercopy: Kernel memory overwrite attempt detected to > linear kernel text (offset 1933296, size 8)! > make[1]: *** [scripts/Makefile.build:555: arch/ia64/kernel] Error 2 > make: *** [Makefile:1855: arch/ia64] Error 2 In my understanding hardened_usercopy=on is completely broken on ia64 today. It can't run any userspace. Even init process would not survive machine boot. At least that's what I experienced on rx3600. Thus I think if your system survives that much time I would guess that you have hardened_usercopy=off in full effect at least at boot. I would speculate it's some kind of memory corruption around 'bypass_usercopy_checks' key. Worth adding a few printk()s to mm/usercopy.c into 'usercopy_abort()' and into 'set_hardened_usercopy()' just to make sure 'bypass_usercopy_checks' has expected 'true' setting at boot time and at crash time. -- Sergei