I've been chasing down the following flaky splat, introduced by recent
changes in BTF generation [1]:

  ------------[ cut here ]------------
  BUG: unable to handle page fault for address: ffa000000233d828
  #PF: supervisor read access in kernel mode
  #PF: error_code(0x0000) - not-present page
  PGD 100000067 P4D 100253067 PUD 100258067 PMD 0
  Oops: Oops: 0000 [#1] SMP NOPTI
  CPU: 1 UID: 0 PID: 390 Comm: test_progs Tainted: G        W  OE       
6.19.0-rc1-gf785a31395d9 #331 PREEMPT(full)
  Tainted: [W]=WARN, [O]=OOT_MODULE, [E]=UNSIGNED_MODULE
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-4.el9 
04/01/2014
  RIP: 0010:simplify_symbols+0x2b2/0x480
     9.737179] Code: 85 f6 4d 89 f7 b8 01 00 00 00 4c 0f 44 f8 49 83 fd f0 4d 
0f 44 fe 75 5b 4d 85 ff 0f 85 76 ff ff ff eb 50 49 8b 4e 20 c1 e0 06 <48> 8b 44 
01 10 9 cf fd ff ff 49 89 c5 eb 36 49 c7 c5
  RSP: 0018:ffa00000017afc40 EFLAGS: 00010216
  RAX: 00000000003fffc0 RBX: 0000000000000002 RCX: ffa0000001f3d858
  RDX: ffffffffc0218038 RSI: ffffffffc0218008 RDI: aaaaaaaaaaaaaaab
  RBP: ffa00000017afd18 R08: 0000000000000072 R09: 0000000000000069
  R10: ffffffff8160d6ca R11: 0000000000000000 R12: ffa0000001f3d577
  R13: ffffffffc0214058 R14: ffa00000017afdc0 R15: ffa0000001f3e518
  FS:  00007f1c638654c0(0000) GS:ff1100089b7bc000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: ffa000000233d828 CR3: 000000010ba1f001 CR4: 0000000000771ef0
  PKRU: 55555554
  Call Trace:
   <TASK>
   ? __kmalloc_node_track_caller_noprof+0x37f/0x740
   ? __pfx_setup_modinfo_srcversion+0x10/0x10
   ? srso_alias_return_thunk+0x5/0xfbef5
   ? kstrdup+0x4a/0x70
   ? srso_alias_return_thunk+0x5/0xfbef5
   ? setup_modinfo_srcversion+0x1a/0x30
   ? srso_alias_return_thunk+0x5/0xfbef5
   ? setup_modinfo+0x12b/0x1e0
   load_module+0x133a/0x1610
   __x64_sys_finit_module+0x31b/0x450
   ? entry_SYSCALL_64_after_hwframe+0x76/0x7e
   do_syscall_64+0x80/0x2d0
   ? srso_alias_return_thunk+0x5/0xfbef5
   ? exc_page_fault+0x95/0xc0
   entry_SYSCALL_64_after_hwframe+0x76/0x7e
  RIP: 0033:0x7f1c63a2582d
     9.794028] Code: 5b 41 5c c3 66 0f 1f 84 00 00 00 00 00 f3 0f 1e fa 48 89 
f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 
f0 ff 8 8b 0d bb 15 0f 00 f7 d8 64 89 01 48
  RSP: 002b:00007ffe513df128 EFLAGS: 00000206 ORIG_RAX: 0000000000000139
  RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f1c63a2582d
  RDX: 0000000000000000 RSI: 0000000000ee83f9 RDI: 0000000000000016
  RBP: 00007ffe513df150 R08: 0000000000000000 R09: 0000000000000000
  R10: 0000000000000000 R11: 0000000000000206 R12: 00007ffe513e3588
  R13: 000000000088fad0 R14: 00000000014bddb0 R15: 00007f1c63ba7000
   </TASK>
  Modules linked in: bpf_testmod(OE)
  CR2: ffa000000233d828
  ---[ end trace 0000000000000000 ]---
  RIP: 0010:simplify_symbols+0x2b2/0x480
     9.821595] Code: 85 f6 4d 89 f7 b8 01 00 00 00 4c 0f 44 f8 49 83 fd f0 4d 
0f 44 fe 75 5b 4d 85 ff 0f 85 76 ff ff ff eb 50 49 8b 4e 20 c1 e0 06 <48> 8b 44 
01 10 9 cf fd ff ff 49 89 c5 eb 36 49 c7 c5
  RSP: 0018:ffa00000017afc40 EFLAGS: 00010216
  RAX: 00000000003fffc0 RBX: 0000000000000002 RCX: ffa0000001f3d858
  RDX: ffffffffc0218038 RSI: ffffffffc0218008 RDI: aaaaaaaaaaaaaaab
  RBP: ffa00000017afd18 R08: 0000000000000072 R09: 0000000000000069
  R10: ffffffff8160d6ca R11: 0000000000000000 R12: ffa0000001f3d577
  R13: ffffffffc0214058 R14: ffa00000017afdc0 R15: ffa0000001f3e518
  FS:  00007f1c638654c0(0000) GS:ff1100089b7bc000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: ffa000000233d828 CR3: 000000010ba1f001 CR4: 0000000000771ef0
  PKRU: 55555554
  Kernel panic - not syncing: Fatal exception
  Kernel Offset: disabled

This hasn't happened on BPF CI so far, for example, however I was able
to reproduce it on a particular x64 machine using a kernel built with
LLVM 20.

The crash happens on attempt to load one of the BPF selftest modules
(tools/testing/selftests/bpf/test_kmods/bpf_test_modorder_x.ko) which
is used by kfunc_module_order test.

The reason for the crash is that simplify_symbols() doesn't check for
bounds of the ELF section index:

       for (i = 1; i < symsec->sh_size / sizeof(Elf_Sym); i++) {
                const char *name = info->strtab + sym[i].st_name;

                switch (sym[i].st_shndx) {
                case SHN_COMMON:

                [...]

                default:
                        /* Divert to percpu allocation if a percpu var. */
                        if (sym[i].st_shndx == info->index.pcpu)
                                secbase = (unsigned long)mod_percpu(mod);
                        else
  /** HERE --> **/              secbase = 
info->sechdrs[sym[i].st_shndx].sh_addr;
                        sym[i].st_value += secbase;
                        break;
                }
        }

And in the case I was able to reproduce, the value 0xffff
(SHN_HIRESERVE aka SHN_XINDEX [2]) fell through here.

Now this code fragment is between 15 and 20 years old, so obviously
it's not expected for a kmodule symbol to have such st_shndx
value. Even so, the kernel probably should fail loading the module
instead of crashing, which is what this patch attempts to fix.

Investigating further, I discovered that the module binary became
corrupted by `${OBJCOPY} --update-section` operation updating .BTF_ids
section data in scripts/gen-btf.sh. This explains how the bug has
surfaced after gen-btf.sh was introduced:

  $ llvm-readelf -s --wide bpf_test_modorder_x.ko | grep 'BTF_ID'
  llvm-readelf: warning: 'bpf_test_modorder_x.ko': found an extended symbol 
index (2), but unable to locate the extended symbol index table
  llvm-readelf: warning: 'bpf_test_modorder_x.ko': found an extended symbol 
index (3), but unable to locate the extended symbol index table
  llvm-readelf: warning: 'bpf_test_modorder_x.ko': found an extended symbol 
index (4), but unable to locate the extended symbol index table
       3: 0000000000000000    16 NOTYPE  LOCAL  DEFAULT   RSV[0xffff] 
__BTF_ID__set8__bpf_test_modorder_kfunc_x_ids
  llvm-readelf: warning: 'bpf_test_modorder_x.ko': found an extended symbol 
index (16), but unable to locate the extended symbol index table
       4: 0000000000000008     4 OBJECT  LOCAL  DEFAULT   RSV[0xffff] 
__BTF_ID__func__bpf_test_modorder_retx__44417

vs expected

  $ llvm-readelf -s --wide bpf_test_modorder_x.ko | grep 'BTF_ID'
       3: 0000000000000000    16 NOTYPE  LOCAL  DEFAULT     6 
__BTF_ID__set8__bpf_test_modorder_kfunc_x_ids
       4: 0000000000000008     4 OBJECT  LOCAL  DEFAULT     6 
__BTF_ID__func__bpf_test_modorder_retx__44417

But why? Updating section data without changing it's size is not
supposed to affect sections indices, right?

With a bit more testing I confirmed that this is a LLVM-specific
issue (doesn't reproduce with GCC kbuild), and it's not stable,
because in link-vmlinux.h we also do:

    ${OBJCOPY} --update-section .BTF_ids=${btfids_vmlinux} ${VMLINUX}

However:

  $ llvm-readelf -s --wide ~/workspace/prog-aux/linux/vmlinux | grep 0xffff
  # no output, which is good

So the suspect is the implementation of llvm-objcopy. As it turns out
there is a relevant known bug that explains the flakiness and isn't
fixed yet [3].

[1] 
https://lore.kernel.org/bpf/[email protected]/
[2] https://man7.org/linux/man-pages/man5/elf.5.html
[3] https://github.com/llvm/llvm-project/issues/168060#issuecomment-3533552952

Signed-off-by: Ihor Solodrai <[email protected]>

---

RFC

While this llvm-objcopy bug is not fixed, we can not trust it in the
kernel build pipeline. In the short-term we have to come up with a
workaround for .BTF_ids section update and replace the calls to
${OBJCOPY} --update-section with something else.

One potential workaround is to force the use of the objcopy (from
binutils) instead of llvm-objcopy when updating .BTF_ids section.

Alternatively, we could just dd the .BTF_ids data computed by
resolve_btfids at the right offset in the target ELF file.

Surprisingly I couldn't find a good way to read a section offset and
size from the ELF with a specified format in a command line. Both
readelf and {llvm-}objdump give a human readable output, and it
appears we can't rely on the column order, for example.

We could still try parsing readelf output with awk/grep, covering
output variants that appear in the kernel build.

We can also do:

   llvm-readobj --elf-output-style=JSON --sections "$elf" | \
        jq -r --arg name .BTF_ids '
            .[0].Sections[] |
            select(.Section.Name.Name == $name) |
            "\(.Section.Offset) \(.Section.Size)"'

...but idk man, doesn't feel right.

Most reliable way to determine the size and offset of .BTF_ids section
is probably reading them by a C program with libelf, such as
resolve_btfids. Which is quite ironic, given the recent
changes. Setting the irony aside, we could add smth like:
         resolve_btfids --section-info=.BTF_ids $elf

Reverting the gen-btf.sh patch is also a possible workaround, but I'd
really like to avoid it, given that BPF features/optimizations in
development depend on it.

I'd appreciate comments and suggestions on this issue. Thank you!
---
 kernel/module/main.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/kernel/module/main.c b/kernel/module/main.c
index 710ee30b3bea..5bf456fad63e 100644
--- a/kernel/module/main.c
+++ b/kernel/module/main.c
@@ -1568,6 +1568,13 @@ static int simplify_symbols(struct module *mod, const 
struct load_info *info)
                        break;
 
                default:
+                       if (sym[i].st_shndx >= info->hdr->e_shnum) {
+                               pr_err("%s: Symbol %s has an invalid section 
index %u (max %u)\n",
+                                      mod->name, name, sym[i].st_shndx, 
info->hdr->e_shnum - 1);
+                               ret = -ENOEXEC;
+                               break;
+                       }
+
                        /* Divert to percpu allocation if a percpu var. */
                        if (sym[i].st_shndx == info->index.pcpu)
                                secbase = (unsigned long)mod_percpu(mod);
-- 
2.52.0


Reply via email to