Issue 173269
Summary [MLIR][ExecutionEngine]
Labels mlir
Assignees
Reporter tqchen
    
This is a summary issue describing a recent case we encountered when using the MLIR execution engine for JIT. The situation where it fails is quite intricate so it is hard to have a minimal reproduction. This issue aims to bring a summary and sharing with the community here. 
 
## High-level summary of the issue

On AArch64,  when we JIT execute a program that may result in [GOT](https://en.wikipedia.org/wiki/Global_Offset_Table), the resulting program may segfault. So far we see such errors occur when a LLVM dialect program passes a pointer to a function foo as argument to another function, and foo’s address will be stored in GOT.
 
## Detailed Explanation 

Specifically, when a binary contains GOT that can be used to track function pointers, the program will choose to use ARM ADRP instruction to relatively calculate the GOT location in units of pages. The following code is a dump from real segfault cases where we are looking up three function pointers.

```
    0xfffe8d681074: adrp   x1, -235137
    0xfffe8d681078: adrp   x2, -235137
    0xfffe8d68107c: adrp   x3, -235137
```

These three lines are reading function pointers from GOT entry, which usually resides in a different section other than the text section. However, ARM ARP instructions have a limitation that such relative jumps have to be within 4GB. This is reasonable for normal dynamic shared libraries. Since if we build the code into a foo.so, The GOT and code section are loaded together in foo.so, so as long as foo.so do not exceed 4GB, we won't have an issue.

In the case of JIT execution, the JIT engine can use mmap to request different pages for the GOT and text section. So a bug can happen  when the GOT and text sections are allocated to be 4GB apart in virtual address space. In such cases, it seems the resulting ADRP  contains truncated offset and they get to the wrong address, resulting in a segfault.

Usually this also won’t be an issue if mmap requests allocate a continuous set of memory. However, modern OS can apply ASLR([address space layout randomization](https://en.wikipedia.org/wiki/Address_space_layout_randomization) ), which will increase the fragmentation, which increases chance of the GOT and text to be more than 4GB apart in virtual address space. 

Because of the combination of ASLR and ADRP, we can observe such segfaults quite frequently in real world machine learning systems use cases.

## Possible Actions and Mitigations

This section contains notes about possible actions and mitigations that we are aware of. They are not exclusive to each other. As of now, we mitigated the issue via A0. A1 and A2 are listed as recommendations to the community.

**A0: workaround mitigation in the compiler end.** As the execution engine comes with limitations, one current mitigation that can be applied to DSL compilers that compiles to the engine is to work around this issue. Specifically here is one approach that worked for our case:
For each function pointer, create a new const global variable point that initializes the function. Also mark their section as .text so they can be placed in the code section.
Emit volatile loads to these function pointer addresses (so these loads do not get optimized away) when we need to pass function pointers. This approach would eliminate the ADRP and instead the relocation of the function pointers will change to `REL_ABS64` and can be taken by compilers that can control llvm codegen.

**A1: Migrate MLIRExecutionEngine from RuntimeDyld to JITLink.** It is still desirable to directly fix the execution engine as GOT is a pretty common thing in ELF. I am not as familiar with the case, but genai seems to suggest that JITLink would help resolve some of the issues due to smarter placement strategy. There is also an ongoing request tracking here https://github.com/llvm/llvm-project/issues/170647

**A2: More robust checks in ADRP Linking.**  When an object file contains ADRP, it is desirable to check the address offset calculation to ensure they do not go OOB, and produce a clear error in linking time.

_______________________________________________
llvm-bugs mailing list
[email protected]
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-bugs

Reply via email to