> On 2022/Apr/25, at 14:09, Sergei Trofimovich <sly...@gmail.com> wrote: > > On Mon, 25 Apr 2022 15:07:58 +0000 > Pedro Miguel Justo <pm...@texair.net> wrote: > >>> On 2022/Apr/25, at 01:22, Pedro Miguel Justo <pm...@texair.net> wrote: >>> >>> >>> >>>> On 2022/Apr/25, at 01:14, Frank Scheiner <frank.schei...@web.de> wrote: >>>> >>>> Hi guys, >>>> >>>> On 25.04.22 10:09, John Paul Adrian Glaubitz wrote: >>>>>> From what I can understand by the information in the bugcheck, this is >>>>>> somewhat related to a violation >>>>>> in parameter copy from user to kernel during some boot-time, crypto, >>>>>> self-test. Does that sound right? >>>>>> If that is the case, how would this be related to FW? >>>>> >>>>> I'm not claiming that it must be related to the firmware, I'm just saying >>>>> that I don't see this problem >>>>> on my RX2660 at all and I have even reinstalled it recently with one of >>>>> the latest firmware images >>>>> without having to pass any parameter to the command line. >>>> >>>> A difference between Adrian's rx2660 and Pedro's rx2660 is Montecito >>>> left and Montvale right. >>>> >>>> But could still be multiple other reasons we haven't looked at yet in >>>> detail: >>>> >>>> * amount of memory installed >>>> * SMT enabled or not >>>> * number of processor modules installed >>>> >>>> It might be possible for me to check on my rx2660s (one with Montvale >>>> and one with Montecito(s)) tomorrow. I will then also look at my other >>>> Itanium gear and gather relevant information. >>>> >>> >>> Yes, this sounds mode likely to me too. >>> >>> The crypto self-tests seem to be an innocent bystander here. I tried >>> booting the most recent kernel with the option “cryptomgr.notests” and it >>> went much farther. Alas it still failed with another buffer copy validation >>> for a different caller altogether: >>> >>> [ 3.836466] [<a000000101353690>] usercopy_abort+0x120/0x130 >>> [ 3.836466] sp=e0000001000cfdf0 >>> bsp=e0000001000c9388 >>> [ 3.836466] [<a0000001004c5660>] __check_object_size+0x3c0/0x420 >>> [ 3.836466] sp=e0000001000cfe00 >>> bsp=e0000001000c9350 >>> [ 3.836466] [<a000000100570030>] sys_getcwd+0x250/0x420 >>> [ 3.836466] sp=e0000001000cfe00 >>> bsp=e0000001000c92c8 >>> [ 3.836466] [<a00000010000c860>] ia64_ret_from_syscall+0x0/0x20 >>> [ 3.836466] sp=e0000001000cfe30 >>> bsp=e0000001000c92c8 >>> [ 3.836466] [<a000000000040720>] ia64_ivt+0xffffffff00040720/0x400 >>> [ 3.836466] sp=e0000001000d0000 >>> bsp=e0000001000c92c8 >>> >>> This suggests the bug might be in the logic validating these buffers >>> against the allocations (heap, span, etc). >>> >>> I don’t know why hardened_usercopy=off is not being observed by the kernel. >>> As a work-around I am copying myself a new kernel with >>> CONFIG_HARDENED_USERCOPY disabled at the source. >>> >> >> Even with kernel "Linux debian 4.19.0-5-mckinley #1 SMP Debian 4.19.37-5 >> (2019-06-19) ia64 GNU/Linux" >> >> Things are still not 100%. After a few hours into building the kernel it >> started crashing also with usercopy validations but, this time, the other >> way around. And because it was the other way around, it led to process >> termination instead of full-blown bugcheck. This could be related or not. >> Coule very well be a different bug that happens to manifest itself round the >> same validation. >> >> CC [M] drivers/net/wireless/realtek/rtw88/rtw8822be.o >> LD [M] drivers/net/wireless/realtek/rtw88/rtw88_8822be.o >> CC [M] drivers/net/wireless/realtek/rtw88/rtw8822c.o >> Segmentation fault >> make[5]: *** [scripts/Makefile.build:293: >> drivers/net/wireless/realtek/rtw88/rtw8822c.o] Error 139 >> make[5]: *** Deleting file 'drivers/net/wireless/realtek/rtw88/rtw8822c.o' >> make[4]: *** [scripts/Makefile.build:555: >> drivers/net/wireless/realtek/rtw88] Error 2 >> make[3]: *** [scripts/Makefile.build:555: drivers/net/wireless/realtek] >> Error 2 >> make[2]: *** [scripts/Makefile.build:555: drivers/net/wireless] Error 2 >> make[1]: *** [scripts/Makefile.build:555: drivers/net] Error 2 >> make: *** [Makefile:1855: drivers] Error 2 >> pmsjt@debian:~/linux-source-5.17$ make >> >> Message from syslogd@debian at Apr 25 07:58:08 ... >> kernel:[23420.984012] usercopy: Kernel memory overwrite attempt detected to >> linear kernel text (offset 1916912, size 8)! >> >> Message from syslogd@debian at Apr 25 07:58:08 ... >> kernel:[23421.268009] usercopy: Kernel memory overwrite attempt detected to >> linear kernel text (offset 1818608, size 8)! >> HOSTCC scripts/sign-file >> CALL scripts/checksyscalls.sh >> <stdin>:1517:2: warning: #warning syscall clone3 not implemented [-Wcpp] >> CALL scripts/atomic/check-atomics.sh >> CHK include/generated/compile.h >> make[2]: *** [scripts/Makefile.build:294: arch/ia64/kernel/signal.o] >> Segmentation fault >> >> Message from syslogd@debian at Apr 25 07:58:11 ... >> kernel:[23423.626254] usercopy: Kernel memory overwrite attempt detected to >> linear kernel text (offset 1933296, size 8)! >> make[1]: *** [scripts/Makefile.build:555: arch/ia64/kernel] Error 2 >> make: *** [Makefile:1855: arch/ia64] Error 2 >
Hi Sergei > In my understanding hardened_usercopy=on is completely broken on ia64 > today. It can't run any userspace. Even init process would not survive > machine boot. At least that's what I experienced on rx3600. > > Thus I think if your system survives that much time I would guess > that you have hardened_usercopy=off in full effect at least at boot. > I want to make sure there is no confusion here. My system only ’survives’ this much when I am using the 4.19 kernel (even when the hardened_usercopy=off is not present). With kernels more recent than that the system will bugcheck very early on boot even if hardened_usercopy=off is present. > I would speculate it's some kind of memory corruption around > 'bypass_usercopy_checks' key. > > Worth adding a few printk()s to mm/usercopy.c into 'usercopy_abort()' > and into 'set_hardened_usercopy()' just to make sure 'bypass_usercopy_checks' > has expected 'true' setting at boot time and at crash time. Right - we definitively need more context about what is the root cause and characteristics of the bug. When the failure happens, is the (pointer, range) of the copy really out-of-whack, or is the validation code not making sense of the boundaries and over-actively failing. > > -- > > Sergei