John Ogness <john.ogn...@linutronix.de> writes: > On 2021-01-22, Sven Schnelle <sv...@linux.ibm.com> wrote: >> >> So somehow the pointer for console_drivers changes. >> >> I can't provide the normal kernel crash output as printk is no longer >> working, > > I don't understand what you mean here. The crash tool can dump the > printk buffer. > > I would be curious to see what the messages look like. It would also be > helpful to know the last message you saw on the console. >
The last message is: [ 1845.640466] User process fault: interruption code 0007 ilc:2 in libm.so[7d78c000+a3000] [ 1845.640474] CPU: 9 PID: 859915 Comm: ld.so.1 Not tainted 5.11.0-20210122.rc4.git0.9f29bd8b2e71.300.fc33.s390x+git #1 [ 1845.640476] Hardware name: IBM 8561 T01 703 (LPAR) [ 1845.640478] User PSW : 0705300080000000 000000007d7983b2 [ 1845.640480] R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:1 AS:0 CC:3 PM:0 RI:0 EA:1 [ 1845.640483] User GPRS: 0000000000000000 000000007d798360 0000000000000010 000003ff00000004 [ 1845.640485] 000000007f854dd0 000000007d7e9788 000003ff00401b30 000000007f854ff8 [ 1845.640487] 000003ff00000000 000003ff7d5b4da0 000003ff00000000 000003ff00000010 [ 1845.640489] 000003ff004022c6 000003ff004024b0 0000000080401796 000000007f854de8 That's also what's written to the console. Please note the the output above is expected. It's only reporting that a user space program performed an invalid FPU operation. The kernel crash is neither written to the console nor the dmesg buffer. >> kmsg_dump_rewind+290 is: >> >> if (!(con->flags & CON_ENABLED)) >> continue; >> >> in kernel/printk/printk.c:1845 >> >> I haven't dived into whether our show_code() is doing something wrong >> which was covered in the past or whether that's because of the patch >> above but wanted to make you aware of that in case you have an idea. >> Otherwise i have to dig into the code. > > Unless we are dealing with very long printk messages that normally get > truncated (800+ characters) this patch simply adds a string > terminator. I do not see how that could possibly cause this kind of > damage. > > When this triggers, there is nothing happening with consoles registering > or deregistering, right? That's correct. No registering/unregistering taking place. > The string terminator (kernel/printk/printk.c:1402) is only added for > paranoia. If you comment that out, this patch will have no effect (for > "normal" length lines). I would be curious if that somehow makes a > difference for you. I was able to reproduce it in a virtual machine where i have a few more ways to debug. What i got was: 01: -> 00000000001B8814" MVI 92001000 >> 000000000163F1CD CC 2 That's a watchpoint telling me that the code at 0x1b8814 wants to store one byte to 0x163f1cd, which is the second byte of console_drivers. gdb tells me about 0x1b8814: (gdb) list *(0x1b8814) 0x1b8814 is in record_print_text (/home/svens/ibmgit/linux/kernel/printk/printk.c:1402). 1397 * If a buffer was provided, it will be terminated. Space for the 1398 * string terminator is guaranteed to be available. The terminator is 1399 * not counted in the return value. 1400 */ 1401 if (buf_size > 0) 1402 text[len] = 0; 1403 1404 return len; 1405 } 1406 In s390 assembly, this is the store: 0x00000000001b8810 <+0x130>: la %r1,0(%r7,%r9) 0x00000000001b8814 <+0x134>: mvi 0(%r1),0 The cpu registers at the time of write: 01: GRG 0 = 0000000000000020 000000000163F1CD 01: GRG 2 = 0000000000000042 000003E000623A98 01: GRG 4 = 000000000000000E 000000000087BB70 01: GRG 6 = 0000000000000400 *0000000000000224* 01: GRG 8 = 000000000000000F *000000000163EFA9* 01: GRG 10 = 0000000000000033 0000000000000000 01: GRG 12 = 00000000809AE000 000003E000623A98 01: GRG 14 = 00000000001B884C 000003E0006239E8 So r9 is 000000000163EFA9 - seems to be the start of the current message: 0163EFA9: *[ 24.069514] 000000007d7af3c2: b30d0002..debr.%f0,%f2.debr.%f0,%f2 while r7 is the offset, and that one is way to big: 0000000000000224 If you add that to 0163EFA9, you'll see that we're outside of the buffer and overwriting space after, which is console_drivers. The message it's trying to print is rather long, so the 0x224 could be the size of the whole line: This is not from the same crash, but this is how the message looks like it's trying to print: [ 23.960773] User Code: 000000007d7af3ba: 78005000 le %f0,0(%r5) [ 23.960773] 000000007d7af3be: 78205004 le %f2,4(%r5) [ 23.960773] #000000007d7af3c2: b30d0002 debr %f0,%f2 [ 23.960773] >000000007d7af3c6: a7280000 lhi %r2,0 [ 23.960773] 000000007d7af3ca: 07fe bcr 15,%r14 [ 23.960773] 000000007d7af3cc: 0707 bcr 0,%r7 [ 23.960773] 000000007d7af3ce: 0707 bcr 0,%r7 [ 23.960773] 000000007d7af3d0: c050000289ec larl %r5,000000007d8007a8 I'm pretty sure it's related to the way how the s390 disassembler prints the buffer, it uses several pr_cont() calls, and it's crashing in the last pr_cont("\n") (in arch/s390/kernel/dis.c - show_code()); I'm trying to get the content of the log buffer and other state, if there's anything else i can provide let me know.