> On 2022/Apr/25, at 14:27, Pedro Miguel Justo <pm...@texair.net> wrote: > > > >> On 2022/Apr/25, at 14:09, Sergei Trofimovich <sly...@gmail.com> wrote: >> >> On Mon, 25 Apr 2022 15:07:58 +0000 >> Pedro Miguel Justo <pm...@texair.net> wrote: >> >>>> On 2022/Apr/25, at 01:22, Pedro Miguel Justo <pm...@texair.net> wrote: >>>> >>>> >>>> >>>>> On 2022/Apr/25, at 01:14, Frank Scheiner <frank.schei...@web.de> wrote: >>>>> >>>>> Hi guys, >>>>> >>>>> On 25.04.22 10:09, John Paul Adrian Glaubitz wrote: >>>>>>> From what I can understand by the information in the bugcheck, this is >>>>>>> somewhat related to a violation >>>>>>> in parameter copy from user to kernel during some boot-time, crypto, >>>>>>> self-test. Does that sound right? >>>>>>> If that is the case, how would this be related to FW? >>>>>> >>>>>> I'm not claiming that it must be related to the firmware, I'm just >>>>>> saying that I don't see this problem >>>>>> on my RX2660 at all and I have even reinstalled it recently with one of >>>>>> the latest firmware images >>>>>> without having to pass any parameter to the command line. >>>>> >>>>> A difference between Adrian's rx2660 and Pedro's rx2660 is Montecito >>>>> left and Montvale right. >>>>> >>>>> But could still be multiple other reasons we haven't looked at yet in >>>>> detail: >>>>> >>>>> * amount of memory installed >>>>> * SMT enabled or not >>>>> * number of processor modules installed >>>>> >>>>> It might be possible for me to check on my rx2660s (one with Montvale >>>>> and one with Montecito(s)) tomorrow. I will then also look at my other >>>>> Itanium gear and gather relevant information. >>>>> >>>> >>>> Yes, this sounds mode likely to me too. >>>> >>>> The crypto self-tests seem to be an innocent bystander here. I tried >>>> booting the most recent kernel with the option “cryptomgr.notests” and it >>>> went much farther. Alas it still failed with another buffer copy >>>> validation for a different caller altogether: >>>> >>>> [ 3.836466] [<a000000101353690>] usercopy_abort+0x120/0x130 >>>> [ 3.836466] sp=e0000001000cfdf0 >>>> bsp=e0000001000c9388 >>>> [ 3.836466] [<a0000001004c5660>] __check_object_size+0x3c0/0x420 >>>> [ 3.836466] sp=e0000001000cfe00 >>>> bsp=e0000001000c9350 >>>> [ 3.836466] [<a000000100570030>] sys_getcwd+0x250/0x420 >>>> [ 3.836466] sp=e0000001000cfe00 >>>> bsp=e0000001000c92c8 >>>> [ 3.836466] [<a00000010000c860>] ia64_ret_from_syscall+0x0/0x20 >>>> [ 3.836466] sp=e0000001000cfe30 >>>> bsp=e0000001000c92c8 >>>> [ 3.836466] [<a000000000040720>] ia64_ivt+0xffffffff00040720/0x400 >>>> [ 3.836466] sp=e0000001000d0000 >>>> bsp=e0000001000c92c8 >>>> >>>> This suggests the bug might be in the logic validating these buffers >>>> against the allocations (heap, span, etc). >>>> >>>> I don’t know why hardened_usercopy=off is not being observed by the >>>> kernel. As a work-around I am copying myself a new kernel with >>>> CONFIG_HARDENED_USERCOPY disabled at the source.
So, I finished compiling my kernel with CONFIG_HARDENED_USERCOPY disabled. Guess what: pmsjt@debian:~$ uname -a Linux debian 5.17.3-rt17 #2 SMP Mon Apr 25 16:55:00 PDT 2022 ia64 GNU/Linux Yup, the system starts just fine with the most recent kernel. So, two things we can infer from this: - Yes, usercopy validation appears to be broken. The contours of how broken it is are yet unknown but we’ll have to investigate to see what part of the validation is failing. - hardened_usercopy=off seems to be ignored by current kernels. When passing this option the system was still failing just the same. >>>> >>> >>> Even with kernel "Linux debian 4.19.0-5-mckinley #1 SMP Debian 4.19.37-5 >>> (2019-06-19) ia64 GNU/Linux" >>> >>> Things are still not 100%. After a few hours into building the kernel it >>> started crashing also with usercopy validations but, this time, the other >>> way around. And because it was the other way around, it led to process >>> termination instead of full-blown bugcheck. This could be related or not. >>> Coule very well be a different bug that happens to manifest itself round >>> the same validation. >>> >>> CC [M] drivers/net/wireless/realtek/rtw88/rtw8822be.o >>> LD [M] drivers/net/wireless/realtek/rtw88/rtw88_8822be.o >>> CC [M] drivers/net/wireless/realtek/rtw88/rtw8822c.o >>> Segmentation fault >>> make[5]: *** [scripts/Makefile.build:293: >>> drivers/net/wireless/realtek/rtw88/rtw8822c.o] Error 139 >>> make[5]: *** Deleting file 'drivers/net/wireless/realtek/rtw88/rtw8822c.o' >>> make[4]: *** [scripts/Makefile.build:555: >>> drivers/net/wireless/realtek/rtw88] Error 2 >>> make[3]: *** [scripts/Makefile.build:555: drivers/net/wireless/realtek] >>> Error 2 >>> make[2]: *** [scripts/Makefile.build:555: drivers/net/wireless] Error 2 >>> make[1]: *** [scripts/Makefile.build:555: drivers/net] Error 2 >>> make: *** [Makefile:1855: drivers] Error 2 >>> pmsjt@debian:~/linux-source-5.17$ make >>> >>> Message from syslogd@debian at Apr 25 07:58:08 ... >>> kernel:[23420.984012] usercopy: Kernel memory overwrite attempt detected to >>> linear kernel text (offset 1916912, size 8)! >>> >>> Message from syslogd@debian at Apr 25 07:58:08 ... >>> kernel:[23421.268009] usercopy: Kernel memory overwrite attempt detected to >>> linear kernel text (offset 1818608, size 8)! >>> HOSTCC scripts/sign-file >>> CALL scripts/checksyscalls.sh >>> <stdin>:1517:2: warning: #warning syscall clone3 not implemented [-Wcpp] >>> CALL scripts/atomic/check-atomics.sh >>> CHK include/generated/compile.h >>> make[2]: *** [scripts/Makefile.build:294: arch/ia64/kernel/signal.o] >>> Segmentation fault >>> >>> Message from syslogd@debian at Apr 25 07:58:11 ... >>> kernel:[23423.626254] usercopy: Kernel memory overwrite attempt detected to >>> linear kernel text (offset 1933296, size 8)! >>> make[1]: *** [scripts/Makefile.build:555: arch/ia64/kernel] Error 2 >>> make: *** [Makefile:1855: arch/ia64] Error 2 >> > > Hi Sergei > >> In my understanding hardened_usercopy=on is completely broken on ia64 >> today. It can't run any userspace. Even init process would not survive >> machine boot. At least that's what I experienced on rx3600. >> >> Thus I think if your system survives that much time I would guess >> that you have hardened_usercopy=off in full effect at least at boot. >> > > I want to make sure there is no confusion here. My system only ’survives’ > this much when I am using the 4.19 kernel (even when the > hardened_usercopy=off is not present). With kernels more recent than that the > system will bugcheck very early on boot even if hardened_usercopy=off is > present. > >> I would speculate it's some kind of memory corruption around >> 'bypass_usercopy_checks' key. >> >> Worth adding a few printk()s to mm/usercopy.c into 'usercopy_abort()' >> and into 'set_hardened_usercopy()' just to make sure 'bypass_usercopy_checks' >> has expected 'true' setting at boot time and at crash time. > > Right - we definitively need more context about what is the root cause and > characteristics of the bug. When the failure happens, is the (pointer, range) > of the copy really out-of-whack, or is the validation code not making sense > of the boundaries and over-actively failing. > >> >> -- >> >> Sergei >