> On 2022/Apr/25, at 14:27, Pedro Miguel Justo <pm...@texair.net> wrote:
> 
> 
> 
>> On 2022/Apr/25, at 14:09, Sergei Trofimovich <sly...@gmail.com> wrote:
>> 
>> On Mon, 25 Apr 2022 15:07:58 +0000
>> Pedro Miguel Justo <pm...@texair.net> wrote:
>> 
>>>> On 2022/Apr/25, at 01:22, Pedro Miguel Justo <pm...@texair.net> wrote:
>>>> 
>>>> 
>>>> 
>>>>> On 2022/Apr/25, at 01:14, Frank Scheiner <frank.schei...@web.de> wrote:
>>>>> 
>>>>> Hi guys,
>>>>> 
>>>>> On 25.04.22 10:09, John Paul Adrian Glaubitz wrote:  
>>>>>>> From what I can understand by the information in the bugcheck, this is 
>>>>>>> somewhat related to a violation
>>>>>>> in parameter copy from user to kernel during some boot-time, crypto, 
>>>>>>> self-test. Does that sound right?
>>>>>>> If that is the case, how would this be related to FW?  
>>>>>> 
>>>>>> I'm not claiming that it must be related to the firmware, I'm just 
>>>>>> saying that I don't see this problem
>>>>>> on my RX2660 at all and I have even reinstalled it recently with one of 
>>>>>> the latest firmware images
>>>>>> without having to pass any parameter to the command line.  
>>>>> 
>>>>> A difference between Adrian's rx2660 and Pedro's rx2660 is Montecito
>>>>> left and Montvale right.
>>>>> 
>>>>> But could still be multiple other reasons we haven't looked at yet in
>>>>> detail:
>>>>> 
>>>>> * amount of memory installed
>>>>> * SMT enabled or not
>>>>> * number of processor modules installed
>>>>> 
>>>>> It might be possible for me to check on my rx2660s (one with Montvale
>>>>> and one with Montecito(s)) tomorrow. I will then also look at my other
>>>>> Itanium gear and gather relevant information.
>>>>> 
>>>> 
>>>> Yes, this sounds mode likely to me too.
>>>> 
>>>> The crypto self-tests seem to be an innocent bystander here. I tried 
>>>> booting the most recent kernel with the option “cryptomgr.notests” and it 
>>>> went much farther. Alas it still failed with another buffer copy 
>>>> validation for a different caller altogether:
>>>> 
>>>> [    3.836466]  [<a000000101353690>] usercopy_abort+0x120/0x130
>>>> [    3.836466]                                 sp=e0000001000cfdf0 
>>>> bsp=e0000001000c9388
>>>> [    3.836466]  [<a0000001004c5660>] __check_object_size+0x3c0/0x420
>>>> [    3.836466]                                 sp=e0000001000cfe00 
>>>> bsp=e0000001000c9350
>>>> [    3.836466]  [<a000000100570030>] sys_getcwd+0x250/0x420
>>>> [    3.836466]                                 sp=e0000001000cfe00 
>>>> bsp=e0000001000c92c8
>>>> [    3.836466]  [<a00000010000c860>] ia64_ret_from_syscall+0x0/0x20
>>>> [    3.836466]                                 sp=e0000001000cfe30 
>>>> bsp=e0000001000c92c8
>>>> [    3.836466]  [<a000000000040720>] ia64_ivt+0xffffffff00040720/0x400
>>>> [    3.836466]                                 sp=e0000001000d0000 
>>>> bsp=e0000001000c92c8
>>>> 
>>>> This suggests the bug might be in the logic validating these buffers 
>>>> against the allocations (heap, span, etc).
>>>> 
>>>> I don’t know why hardened_usercopy=off is not being observed by the 
>>>> kernel. As a work-around I am copying myself a new kernel with 
>>>> CONFIG_HARDENED_USERCOPY disabled at the source. 

So, I finished compiling my kernel with CONFIG_HARDENED_USERCOPY disabled. 
Guess what:

pmsjt@debian:~$ uname -a
Linux debian 5.17.3-rt17 #2 SMP Mon Apr 25 16:55:00 PDT 2022 ia64 GNU/Linux

Yup, the system starts just fine with the most recent kernel. So, two things we 
can infer from this:
- Yes, usercopy validation appears to be broken. The contours of how broken it 
is are yet unknown but we’ll have to investigate to see what part of the 
validation is failing.
- hardened_usercopy=off seems to be ignored by current kernels. When passing 
this option the system was still failing just the same.


>>>> 
>>> 
>>> Even with kernel "Linux debian 4.19.0-5-mckinley #1 SMP Debian 4.19.37-5 
>>> (2019-06-19) ia64 GNU/Linux"
>>> 
>>> Things are still not 100%. After a few hours into building the kernel it 
>>> started crashing also with usercopy validations but, this time, the other 
>>> way around. And because it was the other way around, it led to process 
>>> termination instead of full-blown bugcheck. This could be related or not. 
>>> Coule very well be a different bug that happens to manifest itself round 
>>> the same validation.
>>> 
>>> CC [M]  drivers/net/wireless/realtek/rtw88/rtw8822be.o
>>> LD [M]  drivers/net/wireless/realtek/rtw88/rtw88_8822be.o
>>> CC [M]  drivers/net/wireless/realtek/rtw88/rtw8822c.o
>>> Segmentation fault
>>> make[5]: *** [scripts/Makefile.build:293: 
>>> drivers/net/wireless/realtek/rtw88/rtw8822c.o] Error 139
>>> make[5]: *** Deleting file 'drivers/net/wireless/realtek/rtw88/rtw8822c.o'
>>> make[4]: *** [scripts/Makefile.build:555: 
>>> drivers/net/wireless/realtek/rtw88] Error 2
>>> make[3]: *** [scripts/Makefile.build:555: drivers/net/wireless/realtek] 
>>> Error 2
>>> make[2]: *** [scripts/Makefile.build:555: drivers/net/wireless] Error 2
>>> make[1]: *** [scripts/Makefile.build:555: drivers/net] Error 2
>>> make: *** [Makefile:1855: drivers] Error 2
>>> pmsjt@debian:~/linux-source-5.17$ make
>>> 
>>> Message from syslogd@debian at Apr 25 07:58:08 ...
>>> kernel:[23420.984012] usercopy: Kernel memory overwrite attempt detected to 
>>> linear kernel text (offset 1916912, size 8)!
>>> 
>>> Message from syslogd@debian at Apr 25 07:58:08 ...
>>> kernel:[23421.268009] usercopy: Kernel memory overwrite attempt detected to 
>>> linear kernel text (offset 1818608, size 8)!
>>> HOSTCC  scripts/sign-file
>>> CALL    scripts/checksyscalls.sh
>>> <stdin>:1517:2: warning: #warning syscall clone3 not implemented [-Wcpp]
>>> CALL    scripts/atomic/check-atomics.sh
>>> CHK     include/generated/compile.h
>>> make[2]: *** [scripts/Makefile.build:294: arch/ia64/kernel/signal.o] 
>>> Segmentation fault
>>> 
>>> Message from syslogd@debian at Apr 25 07:58:11 ...
>>> kernel:[23423.626254] usercopy: Kernel memory overwrite attempt detected to 
>>> linear kernel text (offset 1933296, size 8)!
>>> make[1]: *** [scripts/Makefile.build:555: arch/ia64/kernel] Error 2
>>> make: *** [Makefile:1855: arch/ia64] Error 2
>> 
> 
> Hi Sergei
> 
>> In my understanding hardened_usercopy=on is completely broken on ia64
>> today. It can't run any userspace. Even init process would not survive
>> machine boot. At least that's what I experienced on rx3600.
>> 
>> Thus I think if your system survives that much time I would guess
>> that you have hardened_usercopy=off in full effect at least at boot.
>> 
> 
> I want to make sure there is no confusion here. My system only ’survives’ 
> this much when I am using the 4.19 kernel (even when the 
> hardened_usercopy=off is not present). With kernels more recent than that the 
> system will bugcheck very early on boot even if hardened_usercopy=off is 
> present.
> 
>> I would speculate it's some kind of memory corruption around
>> 'bypass_usercopy_checks' key.
>> 
>> Worth adding a few printk()s to mm/usercopy.c into 'usercopy_abort()'
>> and into 'set_hardened_usercopy()' just to make sure 'bypass_usercopy_checks'
>> has expected 'true' setting at boot time and at crash time.
> 
> Right - we definitively need more context about what is the root cause and 
> characteristics of the bug. When the failure happens, is the (pointer, range) 
> of the copy really out-of-whack, or is the validation code not making sense 
> of the boundaries and over-actively failing.
> 
>> 
>> -- 
>> 
>> Sergei
> 

Reply via email to