We received a report of boot failure on stable/4.14.y using Clang with CONFIG_UNWINDER_FRAME_POINTER. Turns out, this was a cascaded failure with at least 4 different things going wrong. Working backwards from the failure:
4) There was no fixup handler in backtrace-clang.S for a specific address calculation. If an indirect access to that address triggers a page fault for which no corresponding fixup exists, then a panic is triggered. Panicking triggers another unwind, and all this repeats in an infinite loop. If the unwind was started from within start_kernel(), this results in a kernel that does not boot. We can install a fixup handler to fix the infinite loop, but why was the unwinder accessing an address that would trigger a fault? 3) The unwinder has multiple conditions to know when to stop unwinding, but checking for a valid address in the link register was not one of them. If there was a value for lr that we could check for before using it, then we could add that as another stopping condition to terminate unwinding. But the garbage value in lr in the case of save_stack() wasn't particularly noteworthy from any other address; it was ambiguous whether we had more frames to continue unwinding through or not, but what value would we check for? 2) When following a frame chain, we can generally follow the addresses pushed onto the stack from the link register, lr. The callee generally pushes this value. For our particular failure, the value in the link register upon entry to save_stack() was garbage. The caller, __mmap_switched, does a tail call into save_stack() since we don't plan to return control flow back to __mmap_switched. It uses a `b` (branch) instruction rather than a `bl` (branch+link) which is correct, since there are no instructions after the `b save_stack` in __mmap_switched. If we interpret the value of lr that was pushed on the stack in save_stack(), then it appears that we have further frames to unwind. When observing an unwind on mainline though, lr upon entry to save_stack() was 0x00! It turns out that this exact ambiguity was found+fixed already by upstream commit 59b6359dd92d ("ARM: 8702/1: head-common.S: Clear lr before jumping to start_kernel()") which landed in 4.15-rc1 but was not yet backported to stable/4.14.y. Sent to stable in: https://lore.kernel.org/stable/20200730180340.1724137-1-ndesaulni...@google.com/T/#u That gives us a value in lr upon entry to save_stack() that's noteworthy as a terminal condition during unwinding. But why did we start unwinding in start_kernel() in the first place? 1) A simple WARN_ON_ONCE was being triggered during start_kernel() due to another patch that landed in v4.15-rc9 but wasn't backported to stable/4.14.y. Sent to stable in: https://lore.kernel.org/stable/20200727191746.3644844-1-ndesaulni...@google.com/T/#u Read (or unwound; pun intended) in the order 1), 2), 3), 4) explains the cascading chain of failures that lead to a kernel not booting. Patch 0001 fixes 3) by adding a test for NULL to the conditions to stop unwinding. This prevents the cascade from going further. Patch 0002 fixes 4) by adding a fixup handler. It's not strictly necessary right now, but I get the feeling that we might not be able to trust the value of the link register pushed on the stack. I'm guessing a stack buffer overflow could overwrite this value. Either way, triggering an exception during unwind should be prevented at all costs to avoid infinite loops. Patches 0003/0004 are cleanup/bikeshed, feel free to NACK them and I don't mind dropping them. They're just minor touchups I felt helped readability from when I was debugging these. 0001 (and slightly so 0002) are the only patches I really care about. Nick Desaulniers (4): ARM: backtrace-clang: check for NULL lr ARM: backtrace-clang: add fixup for lr dereference ARM: backtrace-clang: give labels more descriptive names ARM: backtrace: use more descriptive labels arch/arm/lib/backtrace-clang.S | 34 +++++++++++++++++++++------------- arch/arm/lib/backtrace.S | 30 +++++++++++++++--------------- 2 files changed, 36 insertions(+), 28 deletions(-) -- 2.28.0.163.g6104cc2f0b6-goog