This is an automated email from the ASF dual-hosted git repository.
tqchen pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/tvm-ffi.git
The following commit(s) were added to refs/heads/main by this push:
new 3c35034 [OrcJIT] Arena JITLinkMemoryManager with GOTPCRELX fix
(Linux) (#527)
3c35034 is described below
commit 3c35034fd1026011736e19a4e0e1ed0f22058c42
Author: Yaxing Cai <[email protected]>
AuthorDate: Mon Apr 27 03:31:02 2026 +0800
[OrcJIT] Arena JITLinkMemoryManager with GOTPCRELX fix (Linux) (#527)
## Summary
Adds an arena-based `JITLinkMemoryManager` that eliminates
scattered-mmap relocation overflow in LLVM ORC JIT under ASLR / VA
pressure ([LLVM
#173269](https://github.com/llvm/llvm-project/issues/173269)), plus a
workaround for an x86_64 JITLink GOTPCRELX relaxation bug. Linux only;
other platforms fall back to the default `InProcessMemoryManager`.
### Arena memory manager (`orcjit_arena_mm.{h,cc}`)
- Pre-reserves one contiguous VA region via `mmap(PROT_NONE |
MAP_NORESERVE)` at session startup and bump-allocates from it,
guaranteeing all JIT allocations stay within PC-relative range (±2 GB
x86_64, ±4 GB AArch64).
- Default capacity: 4 GB (x86_64) / 8 GB (AArch64). On reservation
failure (RLIMIT_AS, containers) the constructor halves down to a 256 MB
floor.
- **Dual-pool split.** Arena is partitioned at a 2 MB-aligned midpoint
into a non-exec pool (`r--`/`rw-`) and an exec pool (`r-x`). Exec
segments pack tightly into whole 2 MB pages for contiguous r-x layout
and TLB-friendly huge-page promotion. Both pools are capped so
cross-pool Delta32 fixups always resolve inside ±2 GB.
- **Slab commit with THP.** Physical pages are committed in 2 MB slabs,
matching Linux huge page size. `madvise(MADV_HUGEPAGE)` on the full
reservation lets the kernel promote fully-faulted slabs to single TLB
entries.
- **Overflow sections.** Known large absolute-only sections
(`.nv_fatbin`) are routed to separate `mmap()` allocations outside the
arena. Guarded by a two-phase check: name-based candidate selection,
then edge validation that disqualifies any section targeted by a
PC-relative reference.
- **Segment-lifetime handling.** `Finalize`-lifetime pages are freed at
the end of `finalize()`; `Standard`-lifetime pages remain until
`deallocate()`. Free list coalesces adjacent blocks for reuse.
- Decommit is deliberately a no-op: `ELFNixPlatform` deinitializers can
still reference freed allocations during teardown. Physical pages return
to the free list instead; all memory is reclaimed by `munmap` in the
arena destructor.
### GOTPCRELX fix plugin (`orcjit_session.cc`)
- Works around LLVM JITLink's `optimizeGOTAndStubAccesses()` relaxing
`call *foo@GOTPCREL(%rip)` → `addr32 call foo` but tagging the edge as
absolute `Pointer32`. On non-PIE executables with symbols in the low 4
GB, this produces a garbage displacement → SIGSEGV during ORC-runtime
teardown.
- `GOTPCRELXFixPlugin` runs as a `PreFixupPass` after relaxation and
either converts to `BranchPCRel32` when the displacement fits, or
reverts the relaxation (restores `ff 15`/`ff 25` opcodes, retargets the
edge to the GOT entry with `PCRel32`).
### Configuration
`ExecutionSession(arena_size=...)` / `arena_size_bytes` C++ arg: `0` =
arch default, `>0` = custom size, `<0` = disable arena. Linux-only;
ignored on macOS/Windows where the arena is compiled out.
### Tests (`tests/test_arena.py`)
8 arena tests across C/C++/GCC/PIE variants:
- `test_arena_colocation` — objects stay within a small window.
- `test_arena_keeps_objects_close` — scatter baseline under VA blocker
with arena enabled.
- `test_arena_hidden_symbol_with_blocker` — ADRP/PC32 cross-object calls
resolve under VA pressure.
- `test_large_data_section` — 4 MB `.nv_fatbin` loads inside arena when
references are absolute.
- `test_overflow_section_outside_arena` — `.nv_fatbin` routed to
separate mmap, confirmed via address gap.
- `test_dso_handle_relocation_after_failed_materialization` —
`__dso_handle` resolves after prior sessions leaked slabs.
- `test_dso_handle_delta32_with_arena` / `_overflow_without_arena` —
`-fpie` GCC objects under 3 GB VA blocker: with arena → passes; without
arena → Delta32 overflow.
All tests use a 16 MB arena and 256 MB–3 GB VA blockers, safe for CI.
## Test plan
- [x] All orcjit tests pass locally on Linux x86_64 and aarch64
- [ ] CI green on Linux x86_64, Linux aarch64, macOS arm64, Windows
AMD64
- [x] Non-Linux platforms unaffected (arena compiled out under `#ifdef
__linux__`)
---------
Co-authored-by: Yaxing Cai <[email protected]>
---
addons/tvm_ffi_orcjit/CMakeLists.txt | 5 +-
.../python/tvm_ffi_orcjit/session.py | 10 +-
addons/tvm_ffi_orcjit/src/ffi/orcjit_dylib.cc | 4 +-
.../src/ffi/orcjit_memory_manager.cc | 698 +++++++++++++++++++++
.../tvm_ffi_orcjit/src/ffi/orcjit_memory_manager.h | 229 +++++++
addons/tvm_ffi_orcjit/src/ffi/orcjit_session.cc | 204 +++++-
addons/tvm_ffi_orcjit/src/ffi/orcjit_session.h | 11 +-
.../tvm_ffi_orcjit/tests/sources/c/fake_fatbin.c | 64 ++
addons/tvm_ffi_orcjit/tests/sources/c/test_addr.c | 33 +
.../tests/sources/c/test_hidden_caller.c | 61 ++
.../tests/sources/c/test_hidden_helper.c | 53 ++
.../tvm_ffi_orcjit/tests/sources/cc/fake_fatbin.cc | 64 ++
.../tvm_ffi_orcjit/tests/sources/cc/test_addr.cc | 26 +
.../tests/sources/cc/test_hidden_caller.cc | 52 ++
.../tests/sources/cc/test_hidden_helper.cc | 44 ++
addons/tvm_ffi_orcjit/tests/test_arena.py | 674 ++++++++++++++++++++
addons/tvm_ffi_orcjit/tests/utils.py | 28 +-
17 files changed, 2235 insertions(+), 25 deletions(-)
diff --git a/addons/tvm_ffi_orcjit/CMakeLists.txt
b/addons/tvm_ffi_orcjit/CMakeLists.txt
index 5281238..9a86bac 100644
--- a/addons/tvm_ffi_orcjit/CMakeLists.txt
+++ b/addons/tvm_ffi_orcjit/CMakeLists.txt
@@ -37,7 +37,10 @@ execute_process(
find_package(tvm_ffi CONFIG REQUIRED)
# ---- Build shared library ----
-add_library(tvm_ffi_orcjit SHARED src/ffi/orcjit_session.cc
src/ffi/orcjit_dylib.cc)
+add_library(
+ tvm_ffi_orcjit SHARED src/ffi/orcjit_session.cc src/ffi/orcjit_dylib.cc
+ src/ffi/orcjit_memory_manager.cc
+)
set_target_properties(
tvm_ffi_orcjit PROPERTIES CXX_VISIBILITY_PRESET hidden
VISIBILITY_INLINES_HIDDEN ON
)
diff --git a/addons/tvm_ffi_orcjit/python/tvm_ffi_orcjit/session.py
b/addons/tvm_ffi_orcjit/python/tvm_ffi_orcjit/session.py
index 66fb1e4..02fccaf 100644
--- a/addons/tvm_ffi_orcjit/python/tvm_ffi_orcjit/session.py
+++ b/addons/tvm_ffi_orcjit/python/tvm_ffi_orcjit/session.py
@@ -60,19 +60,25 @@ class ExecutionSession(Object):
"""
- def __init__(self, orc_rt_path: str | None = None) -> None:
+ def __init__(self, orc_rt_path: str | None = None, arena_size: int = 0) ->
None:
"""Initialize ExecutionSession.
Args:
orc_rt_path: Optional path to the liborc_rt library. If not
provided,
it will be automatically discovered using clang.
+ arena_size: Arena size in bytes for the JIT memory manager.
+ Linux only — ignored on macOS and Windows, where the
+ arena is compiled out.
+ 0 = arch default (4GB x86_64, 8GB AArch64; falls back
to
+ smaller sizes under RLIMIT_AS), >0 = custom size,
+ <0 = disable arena.
"""
if orc_rt_path is None:
orc_rt_path = _find_orc_rt_library()
if orc_rt_path is None:
orc_rt_path = ""
- self.__init_handle_by_constructor__(_ffi_api.ExecutionSession,
orc_rt_path) # type: ignore
+ self.__init_handle_by_constructor__(_ffi_api.ExecutionSession,
orc_rt_path, arena_size) # type: ignore
def create_library(self, name: str = "") -> DynamicLibrary:
"""Create a new dynamic library associated with this execution session.
diff --git a/addons/tvm_ffi_orcjit/src/ffi/orcjit_dylib.cc
b/addons/tvm_ffi_orcjit/src/ffi/orcjit_dylib.cc
index 82c5290..11a6d52 100644
--- a/addons/tvm_ffi_orcjit/src/ffi/orcjit_dylib.cc
+++ b/addons/tvm_ffi_orcjit/src/ffi/orcjit_dylib.cc
@@ -188,7 +188,9 @@ static void RegisterOrcJITFunctions() {
refl::GlobalDef()
.def("orcjit.ExecutionSession",
- [](const std::string& orc_rt_path) { return
ORCJITExecutionSession(orc_rt_path); })
+ [](const std::string& orc_rt_path, int64_t arena_size_bytes) {
+ return ORCJITExecutionSession(orc_rt_path, arena_size_bytes);
+ })
.def("orcjit.ExecutionSessionCreateDynamicLibrary",
[](const ORCJITExecutionSession& session, const String& name) ->
Module {
return session->CreateDynamicLibrary(name);
diff --git a/addons/tvm_ffi_orcjit/src/ffi/orcjit_memory_manager.cc
b/addons/tvm_ffi_orcjit/src/ffi/orcjit_memory_manager.cc
new file mode 100644
index 0000000..26be4b8
--- /dev/null
+++ b/addons/tvm_ffi_orcjit/src/ffi/orcjit_memory_manager.cc
@@ -0,0 +1,698 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+/*!
+ * \file orcjit_memory_manager.cc
+ * \brief Arena-based JITLinkMemoryManager implementation.
+ *
+ * Follows the InProcessMemoryManager pattern from LLVM but replaces
+ * per-object mmap with bump allocation from a pre-reserved arena.
+ * Pages are committed in 2 MB slabs to enable Transparent Huge Page
+ * (THP) promotion — see the class docstring in orcjit_memory_manager.h.
+ */
+#include "orcjit_memory_manager.h"
+
+#ifdef __linux__
+
+#include <llvm/ADT/DenseSet.h>
+#include <llvm/ADT/SmallVector.h>
+#include <llvm/ExecutionEngine/JITLink/JITLink.h>
+#include <llvm/ExecutionEngine/JITLink/aarch64.h>
+#include <llvm/ExecutionEngine/JITLink/x86_64.h>
+#include <llvm/ExecutionEngine/Orc/Shared/AllocationActions.h>
+#include <llvm/Support/Alignment.h>
+#include <llvm/Support/FormatVariadic.h>
+#include <llvm/Support/Memory.h>
+#include <sys/mman.h>
+
+#include <algorithm>
+#include <cerrno>
+#include <cstdio>
+#include <cstring>
+
+namespace tvm {
+namespace ffi {
+namespace orcjit {
+
+using namespace llvm;
+using namespace llvm::jitlink;
+using namespace llvm::orc;
+
+// ── Overflow section edge classification ───────────────────────────
+//
+// Conservative whitelist: only known absolute relocation kinds return true.
+// Unknown or future edge kinds default to PC-relative → sections stay in
+// the arena (safe: never breaks relocations, just forgoes the overflow
+// optimization for unknown kinds).
+
+namespace {
+
+bool isAbsoluteEdge(const Triple& TT, Edge::Kind K) {
+ if (K < Edge::FirstRelocation) return true; // KeepAlive, Invalid — not a
relocation constraint
+ if (TT.isAArch64()) {
+ using namespace llvm::jitlink::aarch64;
+ switch (K) {
+ case Pointer64:
+ case Pointer32:
+ case Pointer64Authenticated:
+ case MoveWide16:
+ return true;
+ default:
+ return false;
+ }
+ }
+ if (TT.isX86()) {
+ using namespace llvm::jitlink::x86_64;
+ switch (K) {
+ case Pointer64:
+ case Pointer32:
+ case Pointer32Signed:
+ case Pointer16:
+ case Pointer8:
+ case Size64:
+ case Size32:
+ return true;
+ default:
+ return false;
+ }
+ }
+ return false; // Unknown arch — treat as PC-relative (safe)
+}
+
+} // namespace
+
+// ── Platform abstraction ────────────────────────────────────────────
+
+void* ArenaJITLinkMemoryManager::reserveVA(size_t size) {
+ void* p = ::mmap(nullptr, size, PROT_NONE, MAP_PRIVATE | MAP_ANONYMOUS |
MAP_NORESERVE, -1, 0);
+ if (p == MAP_FAILED) return nullptr;
+ return p;
+}
+
+void ArenaJITLinkMemoryManager::releaseVA(void* addr, size_t size) {
+ int rc = ::munmap(addr, size);
+ assert(rc == 0 && "munmap failed in arena destructor");
+ (void)rc;
+}
+
+Error ArenaJITLinkMemoryManager::commitPages(void* addr, size_t size) {
+ if (size == 0) return Error::success();
+ // Commit at slab (2 MB) granularity for THP promotion.
+ size_t offset = static_cast<char*>(addr) - arena_base_;
+ size_t first_slab = offset / kSlabSize;
+ size_t last_slab = (offset + size - 1) / kSlabSize;
+
+ for (size_t i = first_slab; i <= last_slab; ++i) {
+ if (slab_committed_[i].load(std::memory_order_acquire) != 0) continue;
+ size_t slab_offset = i * kSlabSize;
+ size_t slab_len = std::min(kSlabSize, arena_capacity_ - slab_offset);
+ // mprotect is idempotent, so a concurrent racer calling it on the same
slab
+ // is harmless. Only flip the flag after success — otherwise a failed
commit
+ // followed by freeRegion() would leave slab_committed_[i] == 1, causing a
+ // later allocation to skip mprotect and write into PROT_NONE memory.
+ if (::mprotect(arena_base_ + slab_offset, slab_len, PROT_READ |
PROT_WRITE) != 0) {
+ return make_error<StringError>(
+ "ArenaJITLinkMemoryManager: mprotect(RW) failed for slab at offset "
+
+ formatv("{0:x}", slab_offset) + ": " + std::strerror(errno),
+ inconvertibleErrorCode());
+ }
+ slab_committed_[i].store(1, std::memory_order_release);
+ }
+ return Error::success();
+}
+
+void ArenaJITLinkMemoryManager::decommitPages(void* addr, size_t size) {
+ // Intentionally a no-op for arena pages. The ORC runtime may still
reference
+ // deallocated JIT memory during session teardown (e.g., ELFNixPlatform
+ // deinitializers run after some allocations are freed). Decommitting
+ // (MADV_DONTNEED or mprotect PROT_NONE) would cause segfaults or illegal
+ // instructions during shutdown.
+ //
+ // Physical pages stay committed but are returned to the free list for reuse.
+ // The arena destructor releases all VA and physical memory via munmap.
+ (void)addr;
+ (void)size;
+}
+
+Error ArenaJITLinkMemoryManager::protectPages(void* addr, size_t size, MemProt
Prot) {
+ int prot = PROT_NONE;
+ if ((Prot & MemProt::Read) != MemProt::None) prot |= PROT_READ;
+ if ((Prot & MemProt::Write) != MemProt::None) prot |= PROT_WRITE;
+ if ((Prot & MemProt::Exec) != MemProt::None) prot |= PROT_EXEC;
+ if (::mprotect(addr, size, prot) != 0) {
+ return make_error<StringError>("ArenaJITLinkMemoryManager: mprotect failed
at " +
+ formatv("{0:x}", addr) + " size " +
formatv("{0:x}", size) +
+ ": " + std::strerror(errno),
+ inconvertibleErrorCode());
+ }
+ if ((Prot & MemProt::Exec) != MemProt::None) {
+ sys::Memory::InvalidateInstructionCache(addr, size);
+ }
+ return Error::success();
+}
+
+// ── ArenaInFlightAlloc ──────────────────────────────────────────────
+
+class ArenaJITLinkMemoryManager::ArenaInFlightAlloc : public
JITLinkMemoryManager::InFlightAlloc {
+ public:
+ // A contiguous region within one pool: [offset, offset + standard_size +
finalize_size).
+ // Standard-lifetime bytes come first; Finalize-lifetime bytes follow and
are freed
+ // at the end of finalize(). Any field may be 0 to indicate no allocation
from
+ // that pool on this call.
+ struct PoolRegion {
+ size_t offset;
+ size_t standard_size;
+ size_t finalize_size;
+ };
+
+ ArenaInFlightAlloc(ArenaJITLinkMemoryManager& MM, LinkGraph& G, BasicLayout
BL,
+ PoolRegion non_exec, PoolRegion exec,
+ std::vector<OverflowBlock> overflow_blocks)
+ : MM(MM),
+ G(&G),
+ BL(std::move(BL)),
+ non_exec_(non_exec),
+ exec_(exec),
+ overflow_blocks_(std::move(overflow_blocks)) {}
+
+ ~ArenaInFlightAlloc() override {
+ assert(!G && "ArenaInFlightAlloc destroyed without finalize or abandon");
+ }
+
+ void finalize(OnFinalizedFunction OnFinalized) override {
+ // Apply target protections for each arena segment.
+ if (auto Err = applyProtections()) {
+ OnFinalized(std::move(Err));
+ return;
+ }
+
+ // Apply target protections for overflow blocks.
+ for (auto& ob : overflow_blocks_) {
+ if (auto Err = MM.protectPages(ob.addr, ob.size, ob.prot)) {
+ OnFinalized(std::move(Err));
+ return;
+ }
+ }
+
+ // Run finalization actions (e.g., register EH frames).
+ auto DeallocActions = shared::runFinalizeActions(BL.graphAllocActions());
+ if (!DeallocActions) {
+ OnFinalized(DeallocActions.takeError());
+ return;
+ }
+
+ // Decommit finalize-lifetime pages in each pool — they're no longer
needed.
+ for (auto& R : {non_exec_, exec_}) {
+ if (R.finalize_size > 0) {
+ MM.decommitPages(MM.arena_base_ + R.offset + R.standard_size,
R.finalize_size);
+ MM.freeRegion(R.offset + R.standard_size, R.finalize_size);
+ }
+ }
+
+#ifndef NDEBUG
+ G = nullptr;
+#endif
+
+ // Create finalized allocation handle. LLVM's FinalizedAlloc stores an
+ // opaque ExecutorAddr (integer), so we must use raw new here. Ownership
+ // transfers to deallocate(), which LLVM guarantees is called for every
+ // finalized allocation.
+ auto* FA = new FinalizedAllocInfo{
+ non_exec_.offset, non_exec_.standard_size, exec_.offset,
+ exec_.standard_size, std::move(*DeallocActions),
std::move(overflow_blocks_)};
+ OnFinalized(FinalizedAlloc(ExecutorAddr::fromPtr(FA)));
+ }
+
+ void abandon(OnAbandonedFunction OnAbandoned) override {
+ // Decommit and return each pool's full region to the appropriate free
list.
+ for (auto& R : {non_exec_, exec_}) {
+ size_t total = R.standard_size + R.finalize_size;
+ if (total > 0) {
+ MM.decommitPages(MM.arena_base_ + R.offset, total);
+ MM.freeRegion(R.offset, total);
+ }
+ }
+
+ // Release overflow blocks.
+ for (auto& ob : overflow_blocks_) {
+ ::munmap(ob.addr, ob.size);
+ }
+
+#ifndef NDEBUG
+ G = nullptr;
+#endif
+
+ OnAbandoned(Error::success());
+ }
+
+ private:
+ Error applyProtections() {
+ for (auto& KV : BL.segments()) {
+ const auto& AG = KV.first;
+ auto& Seg = KV.second;
+
+ auto SegSize = alignTo(Seg.ContentSize + Seg.ZeroFillSize,
MM.page_size_);
+ if (auto Err = MM.protectPages(Seg.WorkingMem, SegSize,
AG.getMemProt())) return Err;
+ }
+ return Error::success();
+ }
+
+ ArenaJITLinkMemoryManager& MM;
+ LinkGraph* G;
+ BasicLayout BL;
+ PoolRegion non_exec_;
+ PoolRegion exec_;
+ std::vector<OverflowBlock> overflow_blocks_;
+};
+
+// ── ArenaJITLinkMemoryManager ───────────────────────────────────────
+
+ArenaJITLinkMemoryManager::ArenaJITLinkMemoryManager(size_t page_size, size_t
arena_capacity)
+ : arena_base_(nullptr),
+ arena_capacity_(arena_capacity),
+ page_size_(page_size),
+ midpoint_(0),
+ exec_bump_limit_(0),
+ non_exec_bump_(0),
+ exec_bump_(0) {
+ // Try requested capacity, halve on failure down to a minimum floor.
+ // The floor is the smaller of kMinArenaCapacity and the requested size,
+ // so explicit small arenas (e.g. 16 MB for tests) are honoured.
+ // mmap(PROT_NONE | MAP_NORESERVE) can still fail under RLIMIT_AS or
+ // extreme VA fragmentation.
+ size_t floor = std::min(arena_capacity_, kMinArenaCapacity);
+ size_t cap = arena_capacity_;
+ while (cap >= floor) {
+ arena_base_ = static_cast<char*>(reserveVA(cap));
+ if (arena_base_) {
+ arena_capacity_ = cap;
+ // Partition the arena into two pools at a 2 MB-aligned midpoint.
+ // The exec pool starts at midpoint_, which is therefore on a 2 MB
+ // boundary — r-x segments pack into a minimum number of 2 MB pages.
+ //
+ // Constraint: cross-pool displacements (e.g. .text → .rodata via
+ // ADRP+ADD on aarch64) must fit in ±kPCRelReach. The farthest pair
+ // of bytes is (end of exec, start of non-exec), separated by at most
+ // `exec_bump_limit_`, so we cap the exec pool's upper bound at
+ // kPCRelReach even when the VA reservation is larger.
+ exec_bump_limit_ = std::min(cap, kPCRelReach);
+ size_t raw_midpoint = static_cast<size_t>(exec_bump_limit_ *
kDefaultNonExecFraction);
+ midpoint_ = (raw_midpoint / kSlabSize) * kSlabSize;
+ if (midpoint_ == 0) midpoint_ = kSlabSize;
+ if (midpoint_ >= exec_bump_limit_) midpoint_ = exec_bump_limit_ -
kSlabSize;
+ non_exec_bump_ = 0;
+ exec_bump_ = midpoint_;
+ // Initialize slab commit tracking. make_unique<T[]>(n)
value-initializes
+ // the array to zero in C++17.
+ num_slabs_ = (cap + kSlabSize - 1) / kSlabSize;
+ slab_committed_ = std::make_unique<std::atomic<uint8_t>[]>(num_slabs_);
+ // Hint THP promotion for the entire arena. Intentionally unchecked —
+ // MADV_HUGEPAGE is advisory and may fail if THP is disabled system-wide.
+ (void)::madvise(arena_base_, cap, MADV_HUGEPAGE);
+ return;
+ }
+ cap /= 2;
+ }
+ llvm::report_fatal_error("ArenaJITLinkMemoryManager: failed to reserve at
least " +
+ Twine(floor / (1024 * 1024)) + " MB of virtual
address space");
+}
+
+ArenaJITLinkMemoryManager::~ArenaJITLinkMemoryManager() {
+ if (arena_base_) {
+ releaseVA(arena_base_, arena_capacity_);
+ }
+}
+
+Expected<size_t> ArenaJITLinkMemoryManager::bumpAllocate(size_t size, bool
is_exec) {
+ std::lock_guard<std::mutex> Lock(mu_);
+
+ auto& free_list = is_exec ? free_list_exec_ : free_list_non_exec_;
+ auto& bump = is_exec ? exec_bump_ : non_exec_bump_;
+ size_t limit = is_exec ? exec_bump_limit_ : midpoint_;
+
+ // Try free list first (best-fit). O(n) scan — acceptable for the expected
+ // workload of tens of JIT allocations, not thousands.
+ size_t best_idx = free_list.size();
+ size_t best_waste = std::numeric_limits<size_t>::max();
+ for (size_t i = 0; i < free_list.size(); ++i) {
+ if (free_list[i].size >= size && free_list[i].size - size < best_waste) {
+ best_idx = i;
+ best_waste = free_list[i].size - size;
+ if (best_waste == 0) break;
+ }
+ }
+
+ if (best_idx < free_list.size()) {
+ size_t offset = free_list[best_idx].offset;
+ if (free_list[best_idx].size == size) {
+ free_list.erase(free_list.begin() + best_idx);
+ } else {
+ free_list[best_idx].offset += size;
+ free_list[best_idx].size -= size;
+ }
+ return offset;
+ }
+
+ // Bump allocate within the pool's limit.
+ if (bump + size > limit) {
+ return make_error<StringError>(
+ std::string("ArenaJITLinkMemoryManager: ") + (is_exec ? "exec" :
"non-exec") +
+ " pool exhausted (used " + formatv("{0:x}", bump).str() + " +
requested " +
+ formatv("{0:x}", size).str() + " > limit " + formatv("{0:x}",
limit).str() + ")",
+ inconvertibleErrorCode());
+ }
+
+ size_t offset = bump;
+ bump += size;
+ return offset;
+}
+
+void ArenaJITLinkMemoryManager::freeRegion(size_t offset, size_t size) {
+ if (size == 0) return;
+ std::lock_guard<std::mutex> Lock(mu_);
+
+ // Route to the correct pool's free list based on offset.
+ auto& free_list = (offset >= midpoint_) ? free_list_exec_ :
free_list_non_exec_;
+
+ // Insert into free list in sorted order.
+ auto it = std::lower_bound(free_list.begin(), free_list.end(), offset,
+ [](const FreeBlock& fb, size_t off) { return
fb.offset < off; });
+ it = free_list.insert(it, FreeBlock{offset, size});
+
+ // Coalesce with next.
+ auto next = it + 1;
+ if (next != free_list.end() && it->offset + it->size == next->offset) {
+ it->size += next->size;
+ free_list.erase(next);
+ }
+
+ // Coalesce with previous.
+ if (it != free_list.begin()) {
+ auto prev = it - 1;
+ if (prev->offset + prev->size == it->offset) {
+ prev->size += it->size;
+ free_list.erase(it);
+ }
+ }
+}
+
+void ArenaJITLinkMemoryManager::allocate(const JITLinkDylib* JD, LinkGraph& G,
+ OnAllocatedFunction OnAllocated) {
+ // ── Overflow section classification ──
+ //
+ // Sections matching known overflow names (e.g. .nv_fatbin — large GPU
+ // device blobs referenced only by absolute relocations) are allocated
+ // outside the arena via separate mmap(), keeping the arena compact for
+ // code + small rodata.
+ //
+ // Two-phase check:
+ // Phase 1 — Name-based candidate selection (.nv_fatbin).
+ // Phase 2 — Edge validation: any PC-relative cross-section edge
+ // targeting a candidate section disqualifies it (the
+ // section stays in the arena). This handles cases where
+ // the compiler generates ADRP/RIP-relative refs even for
+ // data sections.
+ //
+ // Validated candidates are temporarily set to NoAlloc so BasicLayout
+ // skips them, then immediately restored before returning. By the time
+ // JITLink's fixUpBlocks runs, sections are back to Standard — avoiding
+ // the debug assert that prohibits edges from allocated sections to
+ // NoAlloc sections.
+ DenseSet<Section*> overflow_candidates;
+ for (auto& Sec : G.sections()) {
+ if (Sec.getMemLifetime() == MemLifetime::NoAlloc) continue;
+ StringRef Name = Sec.getName();
+ if (Name.starts_with(".nv_fatbin")) {
+ overflow_candidates.insert(&Sec);
+ }
+ }
+
+ // Phase 2: edge validation — disqualify candidates with incoming
PC-relative edges.
+ if (!overflow_candidates.empty()) {
+ const auto& TT = G.getTargetTriple();
+ for (auto& Sec : G.sections()) {
+ for (auto* B : Sec.blocks()) {
+ for (auto& E : B->edges()) {
+ if (!E.isRelocation()) continue;
+ if (isAbsoluteEdge(TT, E.getKind())) continue;
+ // PC-relative edge — if it targets a candidate, disqualify.
+ if (!E.getTarget().isDefined()) continue;
+ auto* TargetSec = &E.getTarget().getBlock().getSection();
+ overflow_candidates.erase(TargetSec);
+ }
+ }
+ if (overflow_candidates.empty()) break;
+ }
+ }
+
+ // Apply: temporarily hide validated overflow sections from BasicLayout.
+ SmallVector<std::pair<Section*, MemLifetime>, 4> overflow_sections;
+ for (auto* Sec : overflow_candidates) {
+ overflow_sections.push_back({Sec, Sec->getMemLifetime()});
+ Sec->setMemLifetime(MemLifetime::NoAlloc);
+ }
+
+ BasicLayout BL(G);
+
+ // Restore overflow sections to their original lifetime immediately.
+ // BasicLayout has already captured its segment list; subsequent LLVM
+ // passes (fixUpBlocks) will see the sections as normal Standard sections.
+ for (auto& [Sec, OrigLifetime] : overflow_sections) {
+ Sec->setMemLifetime(OrigLifetime);
+ }
+
+ // Compute total sizes grouped by lifetime.
+ auto SegsSizes = BL.getContiguousPageBasedLayoutSizes(page_size_);
+ if (!SegsSizes) {
+ OnAllocated(SegsSizes.takeError());
+ return;
+ }
+
+ if (SegsSizes->total() > std::numeric_limits<size_t>::max()) {
+ OnAllocated(make_error<llvm::jitlink::JITLinkError>(
+ "Total requested size " + formatv("{0:x}", SegsSizes->total()) + " for
graph " +
+ G.getName() + " exceeds address space"));
+ return;
+ }
+
+ auto TotalSize = static_cast<size_t>(SegsSizes->total());
+ if (TotalSize == 0 && overflow_sections.empty()) {
+ // Empty graph — return a no-op allocation.
+ OnAllocated(std::make_unique<ArenaInFlightAlloc>(
+ *this, G, std::move(BL), ArenaInFlightAlloc::PoolRegion{0, 0, 0},
+ ArenaInFlightAlloc::PoolRegion{midpoint_, 0, 0},
std::vector<OverflowBlock>{}));
+ return;
+ }
+
+ // ── Dual-pool split ──
+ //
+ // Partition each segment into one of four buckets based on (Prot, Lifetime):
+ // non-exec × Standard / Finalize → non-exec pool (below midpoint_)
+ // exec × Standard / Finalize → exec pool (at/above midpoint_)
+ //
+ // Within each pool, Standard segments come first and Finalize segments
+ // second, so the Finalize tail of each pool can be freed after finalize().
+ size_t ne_std_size = 0, ne_fin_size = 0;
+ size_t e_std_size = 0, e_fin_size = 0;
+ for (auto& KV : BL.segments()) {
+ auto& AG = KV.first;
+ auto& Seg = KV.second;
+ auto SegSize = alignTo(Seg.ContentSize + Seg.ZeroFillSize, page_size_);
+ bool is_exec = (AG.getMemProt() & MemProt::Exec) != MemProt::None;
+ bool is_finalize = AG.getMemLifetime() == MemLifetime::Finalize;
+ if (is_exec) {
+ (is_finalize ? e_fin_size : e_std_size) += SegSize;
+ } else {
+ (is_finalize ? ne_fin_size : ne_std_size) += SegSize;
+ }
+ }
+ size_t ne_total = ne_std_size + ne_fin_size;
+ size_t e_total = e_std_size + e_fin_size;
+
+ ArenaInFlightAlloc::PoolRegion ne_region{0, 0, 0};
+ ArenaInFlightAlloc::PoolRegion e_region{midpoint_, 0, 0};
+
+ auto allocPool = [&](size_t req, bool is_exec) -> Expected<size_t> {
+ if (req == 0) return size_t{0};
+ auto off = bumpAllocate(req, is_exec);
+ if (!off) return off.takeError();
+ if (auto Err = commitPages(arena_base_ + *off, req)) {
+ freeRegion(*off, req);
+ return std::move(Err);
+ }
+ std::memset(arena_base_ + *off, 0, req);
+ return *off;
+ };
+
+ if (ne_total > 0) {
+ auto off = allocPool(ne_total, /*is_exec=*/false);
+ if (!off) {
+ OnAllocated(off.takeError());
+ return;
+ }
+ ne_region = {*off, ne_std_size, ne_fin_size};
+ }
+ if (e_total > 0) {
+ auto off = allocPool(e_total, /*is_exec=*/true);
+ if (!off) {
+ // Unwind non-exec allocation on failure to keep the pools consistent.
+ if (ne_total > 0) {
+ decommitPages(arena_base_ + ne_region.offset, ne_total);
+ freeRegion(ne_region.offset, ne_total);
+ }
+ OnAllocated(off.takeError());
+ return;
+ }
+ e_region = {*off, e_std_size, e_fin_size};
+ }
+
+ // Assign addresses to segments from four cursors. Standard comes first in
+ // each pool, then Finalize.
+ auto NeStdCursor = ExecutorAddr::fromPtr(arena_base_ + ne_region.offset);
+ auto NeFinCursor = ExecutorAddr::fromPtr(arena_base_ + ne_region.offset +
ne_std_size);
+ auto EStdCursor = ExecutorAddr::fromPtr(arena_base_ + e_region.offset);
+ auto EFinCursor = ExecutorAddr::fromPtr(arena_base_ + e_region.offset +
e_std_size);
+
+ for (auto& KV : BL.segments()) {
+ auto& AG = KV.first;
+ auto& Seg = KV.second;
+ bool is_exec = (AG.getMemProt() & MemProt::Exec) != MemProt::None;
+ bool is_finalize = AG.getMemLifetime() == MemLifetime::Finalize;
+ auto& Cursor = is_exec ? (is_finalize ? EFinCursor : EStdCursor)
+ : (is_finalize ? NeFinCursor : NeStdCursor);
+ Seg.WorkingMem = Cursor.toPtr<char*>();
+ Seg.Addr = Cursor;
+ auto SegSize = alignTo(Seg.ContentSize + Seg.ZeroFillSize, page_size_);
+ Cursor += SegSize;
+ }
+
+ // Apply layout — copies content and assigns block addresses for arena
segments.
+ if (auto Err = BL.apply()) {
+ // On error: decommit and free both pool regions.
+ if (ne_total > 0) {
+ decommitPages(arena_base_ + ne_region.offset, ne_total);
+ freeRegion(ne_region.offset, ne_total);
+ }
+ if (e_total > 0) {
+ decommitPages(arena_base_ + e_region.offset, e_total);
+ freeRegion(e_region.offset, e_total);
+ }
+ OnAllocated(std::move(Err));
+ return;
+ }
+
+ // ── Allocate overflow sections via mmap() outside the arena ──
+ std::vector<OverflowBlock> overflow_allocs;
+
+ for (auto& [Sec, _] : overflow_sections) {
+ // Compute total size for this section's blocks.
+ size_t total_sec_size = 0;
+ for (auto* B : Sec->blocks()) {
+ total_sec_size = alignTo(total_sec_size, B->getAlignment());
+ total_sec_size += B->getSize();
+ }
+ if (total_sec_size == 0) continue;
+ total_sec_size = alignTo(total_sec_size, page_size_);
+
+ // mmap outside the arena.
+ void* addr =
+ ::mmap(nullptr, total_sec_size, PROT_READ | PROT_WRITE, MAP_PRIVATE |
MAP_ANONYMOUS, -1, 0);
+ if (addr == MAP_FAILED) {
+ // Clean up prior overflow allocs, free both pool regions, report error.
+ for (auto& ob : overflow_allocs) ::munmap(ob.addr, ob.size);
+ if (ne_total > 0) {
+ decommitPages(arena_base_ + ne_region.offset, ne_total);
+ freeRegion(ne_region.offset, ne_total);
+ }
+ if (e_total > 0) {
+ decommitPages(arena_base_ + e_region.offset, e_total);
+ freeRegion(e_region.offset, e_total);
+ }
+ OnAllocated(
+ make_error<StringError>("ArenaJITLinkMemoryManager: overflow mmap
failed for section " +
+ Sec->getName() + ": " +
std::strerror(errno),
+ inconvertibleErrorCode()));
+ return;
+ }
+
+ // Layout blocks within the mmap'd region.
+ char* ptr = static_cast<char*>(addr);
+ for (auto* B : Sec->blocks()) {
+ uint64_t align = B->getAlignment();
+ ptr = reinterpret_cast<char*>(alignTo(reinterpret_cast<uintptr_t>(ptr),
align));
+ size_t bsize = B->getSize();
+ // Copy content and redirect block's mutable content pointer.
+ if (!B->isZeroFill()) {
+ auto content = B->getContent();
+ std::memcpy(ptr, content.data(), content.size());
+ B->setMutableContent(MutableArrayRef<char>(ptr, bsize));
+ }
+ // Assign block address (working mem == executor addr for in-process
JIT).
+ B->setAddress(ExecutorAddr::fromPtr(ptr));
+ ptr += bsize;
+ }
+
+ overflow_allocs.push_back({addr, total_sec_size, Sec->getMemProt()});
+ }
+
+ OnAllocated(std::make_unique<ArenaInFlightAlloc>(*this, G, std::move(BL),
ne_region, e_region,
+
std::move(overflow_allocs)));
+}
+
+void ArenaJITLinkMemoryManager::deallocate(std::vector<FinalizedAlloc> Allocs,
+ OnDeallocatedFunction
OnDeallocated) {
+ Error DeallocErr = Error::success();
+
+ for (auto& Alloc : Allocs) {
+ // Reclaim ownership of the FinalizedAllocInfo created in finalize().
+ auto* FA = Alloc.release().toPtr<FinalizedAllocInfo*>();
+
+ // Run deallocation actions in reverse order.
+ while (!FA->DeallocActions.empty()) {
+ if (auto Err = FA->DeallocActions.back().runWithSPSRetErrorMerged())
+ DeallocErr = joinErrors(std::move(DeallocErr), std::move(Err));
+ FA->DeallocActions.pop_back();
+ }
+
+ // Decommit and free each pool's Standard region.
+ if (FA->non_exec_standard_size > 0) {
+ decommitPages(arena_base_ + FA->non_exec_offset,
FA->non_exec_standard_size);
+ freeRegion(FA->non_exec_offset, FA->non_exec_standard_size);
+ }
+ if (FA->exec_standard_size > 0) {
+ decommitPages(arena_base_ + FA->exec_offset, FA->exec_standard_size);
+ freeRegion(FA->exec_offset, FA->exec_standard_size);
+ }
+
+ // Release overflow blocks.
+ for (auto& ob : FA->overflow_blocks) {
+ ::munmap(ob.addr, ob.size);
+ }
+
+ delete FA;
+ }
+
+ OnDeallocated(std::move(DeallocErr));
+}
+
+} // namespace orcjit
+} // namespace ffi
+} // namespace tvm
+
+#endif // __linux__
diff --git a/addons/tvm_ffi_orcjit/src/ffi/orcjit_memory_manager.h
b/addons/tvm_ffi_orcjit/src/ffi/orcjit_memory_manager.h
new file mode 100644
index 0000000..8f83c38
--- /dev/null
+++ b/addons/tvm_ffi_orcjit/src/ffi/orcjit_memory_manager.h
@@ -0,0 +1,229 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+/*!
+ * \file orcjit_memory_manager.h
+ * \brief Arena-based JITLinkMemoryManager for LLVM ORC JIT.
+ *
+ * Pre-reserves a contiguous virtual address region and bump-allocates
+ * from it, keeping all JIT allocations within range of PC-relative
+ * relocations (±2GB on x86_64, ±4GB on AArch64).
+ *
+ * This eliminates relocation overflow caused by scattered mmap
+ * allocations under ASLR (LLVM issue #173269).
+ *
+ * ## GOTPCRELX relaxation workaround
+ *
+ * The arena triggers a latent bug in LLVM JITLink's
+ * `optimizeGOTAndStubAccesses()` (x86_64.cpp). That pass relaxes
+ * `call *foo@GOTPCREL(%rip)` (ff 15) → `addr32 call foo` (67 e8) and
+ * sets the edge kind to `Pointer32` (absolute 32-bit address). However
+ * the `call rel32` instruction is always **PC-relative** — the `67`
+ * prefix is just padding — so the fixup should be PC-relative too
+ * (matching the static linker's `R_X86_64_PC32`).
+ *
+ * The bug is latent because the relaxation only fires when the target
+ * address fits in 32 bits (`isUInt<32>`). On PIE executables every
+ * resolved symbol is at a high address, so the guard is never true and
+ * the relaxation never runs. On **non-PIE** executables the PLT
+ * entries for libc functions (malloc, free, …) live near 0x400000, the
+ * guard passes, and the wrong fixup produces a garbage displacement →
+ * SIGSEGV during ORC-runtime teardown.
+ *
+ * `GOTPCRELXFixPlugin` in orcjit_session.cc works around this: a
+ * PreFixupPass that runs *after* `optimizeGOTAndStubAccesses` detects
+ * `Pointer32` edges on `67 e8` / `e9` instructions and either
+ * (a) converts to `BranchPCRel32` when the PC-relative displacement
+ * fits in int32, or
+ * (b) reverts the relaxation entirely — restores the `ff 15` /
+ * `ff 25` opcode bytes and retargets the edge to the GOT entry
+ * with `PCRel32` + addend 0.
+ */
+#ifndef TVM_FFI_ORCJIT_ORCJIT_MEMORY_MANAGER_H_
+#define TVM_FFI_ORCJIT_ORCJIT_MEMORY_MANAGER_H_
+
+#include <llvm/ExecutionEngine/JITLink/JITLinkMemoryManager.h>
+#include <llvm/ExecutionEngine/Orc/Shared/MemoryFlags.h>
+
+#include <atomic>
+#include <memory>
+#include <mutex>
+#include <vector>
+
+namespace tvm {
+namespace ffi {
+namespace orcjit {
+
+/*! \brief Arena-based memory manager for JITLink.
+ *
+ * Reserves a large contiguous VA region at construction time using
+ * PROT_NONE (zero physical memory cost). Each allocate() call
+ * bump-allocates from this region, commits pages as RW, and assigns
+ * addresses. On finalization, pages receive their target protections.
+ * On deallocation, pages are decommitted and returned to a free list.
+ *
+ * The default arena size is strictly larger than the architecture's
+ * PC-relative relocation limit (4 GB on x86_64, 8 GB on AArch64) so
+ * the arena is never the bottleneck — JITLink's own relocation overflow
+ * checker fires first, matching dlopen/ld.so failure semantics. If the
+ * initial reservation fails (RLIMIT_AS, container limits), the
+ * constructor halves the capacity down to kMinArenaCapacity (256 MB).
+ *
+ * ## Slab-based commit with Transparent Huge Page (THP) support
+ *
+ * Arena pages are committed in 2 MB slabs (kSlabSize) rather than
+ * per-allocation. Each slab is committed exactly once via an atomic
+ * flag (lock-free, no contention with the allocator mutex).
+ *
+ * Benefits:
+ * - Batches up to 512 page faults into a single sequential mprotect
+ * per slab, reducing kernel trap overhead.
+ * - 2 MB matches the Linux huge page size on both x86_64 and AArch64.
+ * Combined with madvise(MADV_HUGEPAGE) applied at construction, the
+ * kernel can promote each fully-faulted slab into a single TLB
+ * entry (replacing 512 x 4 KB entries), reducing TLB misses during
+ * JIT code execution.
+ * - Worst-case waste is <2 MB in the last partially-used slab —
+ * negligible for typical ML workloads.
+ */
+class ArenaJITLinkMemoryManager : public llvm::jitlink::JITLinkMemoryManager {
+ public:
+ // Default arena: strictly larger than the relocation limit so the arena
+ // is never the bottleneck. JITLink's own overflow check fires first,
+ // matching dlopen/ld.so failure semantics.
+ //
+ // x86_64 PC32: ±2GB → 4GB default (2× headroom)
+ // AArch64 ADRP: ±4GB → 8GB default (2× headroom)
+ static constexpr size_t kDefaultArenaCapacity_x86_64 = size_t{4} << 30; //
4 GB
+ static constexpr size_t kDefaultArenaCapacity_AArch64 = size_t{8} << 30; //
8 GB
+ static constexpr size_t kMinArenaCapacity = size_t{256} << 20; //
256 MB floor
+ // Slab commit granularity. Matches Linux huge page size (2 MB) on both
+ // x86_64 and AArch64, enabling THP promotion via madvise(MADV_HUGEPAGE).
+ static constexpr size_t kSlabSize = size_t{2} << 20; // 2 MB
+ // PC-relative relocation reach (tightest binding fixup). Cross-pool
+ // references (.text → .rodata, .eh_frame → .text, etc.) must fit within
+ // this signed displacement. The binding constraint on both x86_64 and
+ // aarch64 is the signed 32-bit Delta32 used in .eh_frame unwind records
+ // (±2 GB), not the wider ADRP+ADD / RIP-rel reach. The dual-pool allocator
+ // keeps both pools inside kPCRelReach bytes of each other even when the VA
+ // reservation is larger, so cross-pool Delta32 fixups always resolve.
+ static constexpr size_t kPCRelReach = (size_t{1} << 31) - kSlabSize; // ~2
GB
+
+ // Default fraction of the arena reserved for non-exec segments (r--, rw-).
+ // The remainder holds exec segments (r-x). Picked by splitting the arena
+ // at a 2 MB-aligned boundary (midpoint_); the exec pool thus starts on a
+ // 2 MB page boundary, maximizing r-x page packing.
+ // Typical CUDA binding objects: ~2 parts rodata+data to 1 part text.
+ static constexpr double kDefaultNonExecFraction = 2.0 / 3.0;
+
+ explicit ArenaJITLinkMemoryManager(size_t page_size, size_t arena_capacity);
+ ~ArenaJITLinkMemoryManager() override;
+
+ ArenaJITLinkMemoryManager(const ArenaJITLinkMemoryManager&) = delete;
+ ArenaJITLinkMemoryManager& operator=(const ArenaJITLinkMemoryManager&) =
delete;
+ ArenaJITLinkMemoryManager(ArenaJITLinkMemoryManager&&) = delete;
+ ArenaJITLinkMemoryManager& operator=(ArenaJITLinkMemoryManager&&) = delete;
+
+ void allocate(const llvm::jitlink::JITLinkDylib* JD,
llvm::jitlink::LinkGraph& G,
+ OnAllocatedFunction OnAllocated) override;
+
+ void deallocate(std::vector<FinalizedAlloc> Allocs, OnDeallocatedFunction
OnDeallocated) override;
+
+ private:
+ class ArenaInFlightAlloc;
+
+ /*! \brief A section allocated outside the arena via separate mmap().
+ *
+ * Sections whose only cross-section references use absolute relocations
+ * (e.g. .nv_fatbin) are placed here to keep the arena compact. */
+ struct OverflowBlock {
+ void* addr; // mmap'd base address
+ size_t size; // mmap'd size (page-aligned)
+ llvm::orc::MemProt prot; // target protection for finalize
+ };
+
+ /*! \brief Metadata for a finalized allocation, stored via FinalizedAlloc
handle.
+ *
+ * The arena is split into two pools at midpoint_. Each allocate() call may
+ * consume a region from either or both pools. Standard-lifetime pages
remain
+ * committed after finalize(); Finalize-lifetime pages are decommitted at
the
+ * end of finalize(). Zero-sized sub-regions indicate no allocation from
that
+ * pool. */
+ struct FinalizedAllocInfo {
+ size_t non_exec_offset; // offset of non-exec Standard region (or
0 if unused)
+ size_t non_exec_standard_size; // bytes retained in non-exec pool after
finalize
+ size_t exec_offset; // offset of exec Standard region (or
midpoint_ if unused)
+ size_t exec_standard_size; // bytes retained in exec pool after
finalize
+ std::vector<llvm::orc::shared::WrapperFunctionCall> DeallocActions;
+ std::vector<OverflowBlock> overflow_blocks;
+ };
+
+ /*! \brief Bump-allocate from the selected pool. Returns offset within
arena. */
+ llvm::Expected<size_t> bumpAllocate(size_t size, bool is_exec);
+
+ /*! \brief Return a region to the appropriate free list (coalesces adjacent
blocks).
+ * Pool is identified by comparing offset against midpoint_. */
+ void freeRegion(size_t offset, size_t size);
+
+ // ── Platform abstraction ──
+ static void* reserveVA(size_t size);
+ static void releaseVA(void* addr, size_t size);
+ llvm::Error commitPages(void* addr, size_t size);
+ static void decommitPages(void* addr, size_t size);
+ static llvm::Error protectPages(void* addr, size_t size, llvm::orc::MemProt
Prot);
+
+ char* arena_base_;
+ size_t arena_capacity_;
+ size_t page_size_;
+
+ // ── Dual-pool split ──
+ // The arena is partitioned at midpoint_ (a 2 MB-aligned offset) into:
+ // non-exec pool = [arena_base_, arena_base_ + midpoint_
)
+ // exec pool = [arena_base_ + midpoint_, arena_base_ +
exec_bump_limit_)
+ // Both pools grow upward from their base. The exec pool starts on a 2 MB
+ // boundary so r-x segments can pack as tightly as possible into 2 MB pages.
+ //
+ // exec_bump_limit_ = min(arena_capacity_, kPCRelReach). Bytes beyond this
+ // limit stay reserved (VA only, no commit) but are not used for allocation
+ // so cross-pool references always fit within the PC-relative reach.
+ size_t midpoint_;
+ size_t exec_bump_limit_;
+
+ std::mutex mu_;
+ size_t non_exec_bump_; // next free offset in non-exec pool ∈ [0, midpoint_]
+ size_t exec_bump_; // next free offset in exec pool ∈ [midpoint_,
arena_capacity_]
+
+ struct FreeBlock {
+ size_t offset;
+ size_t size;
+ };
+ std::vector<FreeBlock> free_list_non_exec_;
+ std::vector<FreeBlock> free_list_exec_;
+
+ /*! \brief Per-slab commit flags (0 = uncommitted, 1 = committed).
+ * Lock-free: each slab is committed exactly once via compare_exchange. */
+ std::unique_ptr<std::atomic<uint8_t>[]> slab_committed_;
+ size_t num_slabs_ = 0;
+};
+
+} // namespace orcjit
+} // namespace ffi
+} // namespace tvm
+
+#endif // TVM_FFI_ORCJIT_ORCJIT_MEMORY_MANAGER_H_
diff --git a/addons/tvm_ffi_orcjit/src/ffi/orcjit_session.cc
b/addons/tvm_ffi_orcjit/src/ffi/orcjit_session.cc
index ed2290e..0bca531 100644
--- a/addons/tvm_ffi_orcjit/src/ffi/orcjit_session.cc
+++ b/addons/tvm_ffi_orcjit/src/ffi/orcjit_session.cc
@@ -24,6 +24,7 @@
#include "orcjit_session.h"
+#include <llvm/ADT/DenseMap.h>
#include <llvm/ExecutionEngine/JITLink/JITLink.h>
#include <llvm/ExecutionEngine/JITLink/x86_64.h>
#include <llvm/ExecutionEngine/Orc/AbsoluteSymbols.h>
@@ -34,6 +35,7 @@
#include <llvm/ExecutionEngine/Orc/Shared/ExecutorSymbolDef.h>
#include <llvm/Support/DynamicLibrary.h>
#include <llvm/Support/Error.h>
+#include <llvm/Support/Process.h>
#include <llvm/Support/TargetSelect.h>
#include <llvm/TargetParser/SubtargetFeature.h>
#include <tvm/ffi/cast.h>
@@ -44,6 +46,8 @@
#include <cstddef>
#include <cstring>
+#include "orcjit_memory_manager.h"
+
#ifdef _WIN32
#ifndef NOMINMAX
#define NOMINMAX
@@ -441,26 +445,188 @@ class DLLImportDefinitionGenerator : public
llvm::orc::DefinitionGenerator {
};
#endif // _WIN32
-ORCJITExecutionSessionObj::ORCJITExecutionSessionObj(const std::string&
orc_rt_path)
+#if defined(__linux__) && (defined(__x86_64__) || defined(_M_X64))
+/*! \brief Fix LLVM JITLink GOTPCRELX relaxation bug (x86_64).
+ *
+ * optimizeGOTAndStubAccesses() relaxes `call *foo@GOTPCREL(%rip)` (ff 15)
+ * to `addr32 call foo` (67 e8) and sets the edge to Pointer32 (absolute
+ * 32-bit). But `call rel32` is always PC-relative — the CPU computes
+ * target = RIP + imm32, not target = imm32. The Pointer32 fixup writes
+ * the absolute address, producing a wrong displacement.
+ *
+ * This only manifests when external symbols resolve to low addresses
+ * (< 4 GB, e.g. PLT entries in a non-PIE executable) while JIT code is
+ * at high addresses (the arena at 0x7f...). The optimization fires
+ * because isUInt<32>(target) is true, but the resulting fixup is wrong.
+ *
+ * The PreFixupPass reverts broken relaxations back to indirect calls
+ * through the GOT. See orcjit_memory_manager.h for full context.
+ */
+class GOTPCRELXFixPlugin : public llvm::orc::ObjectLinkingLayer::Plugin {
+ public:
+ void modifyPassConfig(llvm::orc::MaterializationResponsibility& MR,
llvm::jitlink::LinkGraph& G,
+ llvm::jitlink::PassConfiguration& Config) override {
+ Config.PreFixupPasses.emplace_back(fixBrokenGOTPCRELXRelaxation);
+ }
+ llvm::Error notifyFailed(llvm::orc::MaterializationResponsibility& MR)
override {
+ return llvm::Error::success();
+ }
+ llvm::Error notifyRemovingResources(llvm::orc::JITDylib& JD,
llvm::orc::ResourceKey K) override {
+ return llvm::Error::success();
+ }
+ void notifyTransferringResources(llvm::orc::JITDylib& JD,
llvm::orc::ResourceKey DstKey,
+ llvm::orc::ResourceKey SrcKey) override {}
+
+ private:
+ /*! \brief Correct broken GOTPCRELX relaxations produced by
+ * optimizeGOTAndStubAccesses().
+ *
+ * Strategy:
+ * 1. Build target-symbol → GOT-entry-symbol map (O(B+S) up front).
+ * 2. For every Pointer32 edge whose preceding bytes are 67 e8
+ * (relaxed call) or e9 (relaxed jmp):
+ * - If the target is reachable via a signed 32-bit PC-relative
+ * displacement, change the edge to BranchPCRel32.
+ * - Otherwise revert the relaxation: restore the original
+ * indirect-call/jmp opcode bytes (ff 15 / ff 25), retarget
+ * the edge to the GOT entry, and use PCRel32 with addend 0
+ * (JITLink normalises GOTPCRELX addends to 0).
+ */
+ static llvm::Error fixBrokenGOTPCRELXRelaxation(llvm::jitlink::LinkGraph& G)
{
+ using namespace llvm::jitlink;
+ // Build block → first symbol at offset 0 (for GOT entry symbol lookup).
+ llvm::DenseMap<Block*, Symbol*> BlockToSym;
+ for (auto* Sym : G.defined_symbols()) {
+ if (Sym->getOffset() == 0 && !BlockToSym.count(&Sym->getBlock())) {
+ BlockToSym[&Sym->getBlock()] = Sym;
+ }
+ }
+
+ // Build target symbol → GOT entry symbol map.
+ // GOT entries are pointer-sized blocks with exactly one Pointer64 edge.
+ llvm::DenseMap<Symbol*, Symbol*> SymToGOTSym;
+ for (auto* B : G.blocks()) {
+ if (B->getSize() != G.getPointerSize()) continue;
+ if (B->edges_size() != 1) continue;
+ auto& E = *B->edges().begin();
+ if (E.getKind() == x86_64::Pointer64) {
+ auto It = BlockToSym.find(B);
+ if (It != BlockToSym.end()) {
+ SymToGOTSym[&E.getTarget()] = It->second;
+ }
+ }
+ }
+
+ for (auto* B : G.blocks()) {
+ for (auto& E : B->edges()) {
+ if (E.getKind() != x86_64::Pointer32) continue;
+ if (E.getOffset() < 2) continue;
+
+ auto MutableContent = B->getMutableContent(G);
+ auto* FixupData = reinterpret_cast<uint8_t*>(MutableContent.data()) +
E.getOffset();
+ uint8_t Prev2 = FixupData[-2];
+ uint8_t Prev1 = FixupData[-1];
+
+ bool isRelaxedCall = (Prev2 == 0x67 && Prev1 == 0xe8);
+ bool isRelaxedJmp = (Prev1 == 0xe9);
+ if (!isRelaxedCall && !isRelaxedJmp) continue;
+
+ // Check if PC-relative displacement would fit.
+ auto TargetAddr = E.getTarget().getAddress();
+ auto FixupAddr = B->getFixupAddress(E);
+ int64_t Displacement = TargetAddr.getValue() - (FixupAddr.getValue() +
4) + E.getAddend();
+ if (llvm::isInt<32>(Displacement)) {
+ E.setKind(x86_64::BranchPCRel32);
+ continue;
+ }
+
+ // Distance doesn't fit — revert to indirect call/jmp through GOT.
+ auto It = SymToGOTSym.find(&E.getTarget());
+ if (It == SymToGOTSym.end()) {
+ return llvm::make_error<llvm::StringError>(
+ "Cannot revert GOTPCRELX relaxation: no GOT entry for " +
+ (E.getTarget().hasName() ?
std::string(*E.getTarget().getName())
+ : std::string("<anon>")),
+ llvm::inconvertibleErrorCode());
+ }
+
+ Symbol* GOTSym = It->second;
+ if (isRelaxedCall) {
+ // Restore: 67 e8 → ff 15 (call *[rip+disp32])
+ FixupData[-2] = 0xff;
+ FixupData[-1] = 0x15;
+ } else {
+ // Restore: e9 XX XX XX XX 90 → ff 25 XX XX XX XX
+ FixupData[-1] = 0xff;
+ FixupData[0] = 0x25;
+ // For jmp, the optimization shifted offset by -1; shift back.
+ E.setOffset(E.getOffset() + 1);
+ }
+ E.setKind(x86_64::PCRel32);
+ E.setTarget(*GOTSym);
+ E.setAddend(0);
+ }
+ }
+ return llvm::Error::success();
+ }
+};
+#endif // __linux__ && __x86_64__
+
+ORCJITExecutionSessionObj::ORCJITExecutionSessionObj(const std::string&
orc_rt_path,
+ int64_t arena_size_bytes)
: jit_(nullptr) {
- // Helper: force JITLink's ObjectLinkingLayer on platforms where
- // the default RTDyldObjectLinkingLayer won't work.
+ // Create arena memory manager — pre-reserves contiguous VA region so all
+ // JIT allocations stay within PC-relative relocation range (±2GB x86_64,
+ // ±4GB AArch64). Eliminates scattered-mmap relocation overflow (LLVM
#173269).
//
- // macOS: MachOPlatform (via ExecutorNativePlatform) requires
ObjectLinkingLayer.
+ // arena_size_bytes: 0 = arch default (4GB x86_64, 8GB AArch64, with
fallback),
+ // >0 = custom size, <0 = disable arena.
+ // The parameter is Linux-only; on macOS/Windows the arena is compiled out
+ // entirely (see #ifdef below) and the value is ignored.
//
- // Windows: LLJIT defaults to RTDyldObjectLinkingLayer for COFF x86_64
- // (see LLJIT.cpp, LLJITBuilderState::prepareForConstruction). We need
- // ObjectLinkingLayer because:
- // 1. Our InitFiniPlugin inherits ObjectLinkingLayer::Plugin — RTDyld has
- // no plugin API, so the static_cast<ObjectLinkingLayer&> would crash.
- // 2. We skip the ORC runtime on Windows (COFFPlatform requires MSVC CRT
- // symbols like _CxxThrowException, RTTI vtables, iostream objects that
- // are not resolvable in the JIT), so we handle .CRT$XC*/.CRT$XT*
- // init/fini sections ourselves via the plugin.
+ // The default is strictly larger than the relocation limit so the arena is
+ // never the bottleneck — JITLink's own overflow check fires first, matching
+ // dlopen/ld.so semantics. The constructor halves capacity on mmap failure
+ // (RLIMIT_AS, containers) down to 256 MB.
//
- // Linux: LLJIT already defaults to ObjectLinkingLayer for ELF, no override
needed.
- auto setup_builder = [](llvm::orc::LLJITBuilder& builder) {
-#if defined(__APPLE__) || defined(_WIN32)
+ // LLJIT auto-configures ObjectLinkingLayer (JITLink) on x86_64 and aarch64
+ // Linux (see LLJITBuilderState::prepareForConstruction). We override
+ // the layer creator to pass our arena MM. macOS/Windows are excluded:
+ // macOS MachOPlatform teardown crashes with the arena; Windows needs
+ // further testing.
+#ifdef __linux__
+ if (arena_size_bytes >= 0) {
+ auto page_size = llvm::sys::Process::getPageSizeEstimate();
+ size_t capacity;
+ if (arena_size_bytes > 0) {
+ capacity = static_cast<size_t>(arena_size_bytes);
+ } else {
+#if defined(__aarch64__)
+ capacity = ArenaJITLinkMemoryManager::kDefaultArenaCapacity_AArch64;
+#else
+ capacity = ArenaJITLinkMemoryManager::kDefaultArenaCapacity_x86_64;
+#endif
+ }
+ memory_manager_ = std::make_unique<ArenaJITLinkMemoryManager>(page_size,
capacity);
+ }
+#endif
+
+ auto setup_builder = [this](llvm::orc::LLJITBuilder& builder) {
+#ifdef __linux__
+ if (memory_manager_) {
+ builder.setObjectLinkingLayerCreator(
+ [this](llvm::orc::ExecutionSession& ES)
+ -> llvm::Expected<std::unique_ptr<llvm::orc::ObjectLayer>> {
+ auto OLL = std::make_unique<llvm::orc::ObjectLinkingLayer>(ES,
*memory_manager_);
+#if defined(__x86_64__) || defined(_M_X64)
+ OLL->addPlugin(std::make_unique<GOTPCRELXFixPlugin>());
+#endif
+ return OLL;
+ });
+ } // if (memory_manager_)
+#elif defined(__APPLE__) || defined(_WIN32)
+ // macOS: MachOPlatform (via ExecutorNativePlatform) requires
ObjectLinkingLayer.
+ // Windows: need ObjectLinkingLayer for InitFiniPlugin and
DLLImportDefinitionGenerator.
builder.setObjectLinkingLayerCreator(
[](llvm::orc::ExecutionSession& ES)
-> llvm::Expected<std::unique_ptr<llvm::orc::ObjectLayer>> {
@@ -608,8 +774,10 @@ ORCJITExecutionSessionObj::ORCJITExecutionSessionObj(const
std::string& orc_rt_p
#endif
}
-ORCJITExecutionSession::ORCJITExecutionSession(const std::string& orc_rt_path)
{
- ObjectPtr<ORCJITExecutionSessionObj> obj =
make_object<ORCJITExecutionSessionObj>(orc_rt_path);
+ORCJITExecutionSession::ORCJITExecutionSession(const std::string& orc_rt_path,
+ int64_t arena_size_bytes) {
+ ObjectPtr<ORCJITExecutionSessionObj> obj =
+ make_object<ORCJITExecutionSessionObj>(orc_rt_path, arena_size_bytes);
data_ = std::move(obj);
}
diff --git a/addons/tvm_ffi_orcjit/src/ffi/orcjit_session.h
b/addons/tvm_ffi_orcjit/src/ffi/orcjit_session.h
index cacffcd..e64c17b 100644
--- a/addons/tvm_ffi_orcjit/src/ffi/orcjit_session.h
+++ b/addons/tvm_ffi_orcjit/src/ffi/orcjit_session.h
@@ -31,9 +31,12 @@
#include <tvm/ffi/string.h>
#include <atomic>
+#include <memory>
#include <string>
#include <unordered_map>
+#include "orcjit_memory_manager.h"
+
namespace tvm {
namespace ffi {
namespace orcjit {
@@ -52,7 +55,8 @@ class ORCJITExecutionSessionObj : public Object {
/*!
* \brief Default constructor (for make_object)
*/
- explicit ORCJITExecutionSessionObj(const std::string& orc_rt_path = "");
+ explicit ORCJITExecutionSessionObj(const std::string& orc_rt_path = "",
+ int64_t arena_size_bytes = 0);
/*!
* \brief Create a new DynamicLibrary (JITDylib) in this session
@@ -95,6 +99,8 @@ class ORCJITExecutionSessionObj : public Object {
void AddPendingDeinitializer(llvm::orc::JITDylib* jd, const InitFiniEntry&
entry);
private:
+ /*! \brief Arena memory manager — must be declared before jit_ for
destruction order */
+ std::unique_ptr<ArenaJITLinkMemoryManager> memory_manager_;
/*! \brief The LLVM ORC JIT instance */
std::unique_ptr<llvm::orc::LLJIT> jit_;
@@ -116,7 +122,8 @@ class ORCJITExecutionSession : public ObjectRef {
* \brief Create a new ExecutionSession
* \return The created execution session instance
*/
- explicit ORCJITExecutionSession(const std::string& orc_rt_path = "");
+ explicit ORCJITExecutionSession(const std::string& orc_rt_path = "",
+ int64_t arena_size_bytes = 0);
// Required: define object reference methods
TVM_FFI_DEFINE_OBJECT_REF_METHODS_NOTNULLABLE(ORCJITExecutionSession,
ObjectRef,
ORCJITExecutionSessionObj);
diff --git a/addons/tvm_ffi_orcjit/tests/sources/c/fake_fatbin.c
b/addons/tvm_ffi_orcjit/tests/sources/c/fake_fatbin.c
new file mode 100644
index 0000000..867ac73
--- /dev/null
+++ b/addons/tvm_ffi_orcjit/tests/sources/c/fake_fatbin.c
@@ -0,0 +1,64 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+/*
+ * Simulates an NVCC-compiled object with a large .nv_fatbin device blob.
+ * The fatbin data is referenced only by absolute relocations (R_*_64 /
+ * R_AARCH64_ABS64), never by PC-relative relocations. This lets us test
+ * overflow-region classification without needing a real CUDA toolchain.
+ *
+ * KEY DETAIL: References go through a pointer in .data (generates
+ * R_AARCH64_ABS64 / R_X86_64_64), not via ADRP/RIP-relative. This
+ * mirrors real NVCC output where __NV_fatbin_* uses absolute relocations.
+ */
+#include <stdint.h>
+#include <tvm/ffi/c_api.h>
+
+#ifdef __APPLE__
+__attribute__((section("__DATA,.nv_fatbin"), used))
+#else
+__attribute__((section(".nv_fatbin"), used))
+#endif
+static const uint8_t fake_fatbin_data[4 * 1024 * 1024] = {0};
+
+/* Indirect reference: .data holds an absolute-relocation pointer to
+ .nv_fatbin. Code accesses .data via PC-relative (ADRP / RIP), and
+ .data→.nv_fatbin is absolute. No PC-relative edge crosses from any
+ section to .nv_fatbin, matching real NVCC objects. */
+static const void* const fatbin_ptr = fake_fatbin_data;
+static const uint64_t fatbin_size = sizeof(fake_fatbin_data);
+
+/* get_fatbin_size: return the size of the fake fatbin blob. */
+TVM_FFI_DLL_EXPORT int __tvm_ffi_get_fatbin_size(void* self, const TVMFFIAny*
args,
+ int32_t num_args, TVMFFIAny*
result) {
+ result->type_index = kTVMFFIInt;
+ result->zero_padding = 0;
+ result->v_int64 = (int64_t)fatbin_size;
+ return 0;
+}
+
+/* get_fatbin_addr: return the address of the fake fatbin data.
+ Used by tests to verify overflow sections land outside the arena. */
+TVM_FFI_DLL_EXPORT int __tvm_ffi_get_fatbin_addr(void* self, const TVMFFIAny*
args,
+ int32_t num_args, TVMFFIAny*
result) {
+ result->type_index = kTVMFFIInt;
+ result->zero_padding = 0;
+ result->v_int64 = (int64_t)(uintptr_t)fatbin_ptr;
+ return 0;
+}
diff --git a/addons/tvm_ffi_orcjit/tests/sources/c/test_addr.c
b/addons/tvm_ffi_orcjit/tests/sources/c/test_addr.c
new file mode 100644
index 0000000..c57281b
--- /dev/null
+++ b/addons/tvm_ffi_orcjit/tests/sources/c/test_addr.c
@@ -0,0 +1,33 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+/*
+ * Returns the code address of this function — for arena co-location tests.
+ * Load into multiple libraries to verify they land in the same arena region.
+ */
+#include <stdint.h>
+#include <tvm/ffi/c_api.h>
+
+TVM_FFI_DLL_EXPORT int __tvm_ffi_code_address(void* self, const TVMFFIAny*
args, int32_t num_args,
+ TVMFFIAny* result) {
+ result->type_index = kTVMFFIInt;
+ result->zero_padding = 0;
+ result->v_int64 = (int64_t)(uintptr_t)&__tvm_ffi_code_address;
+ return 0;
+}
diff --git a/addons/tvm_ffi_orcjit/tests/sources/c/test_hidden_caller.c
b/addons/tvm_ffi_orcjit/tests/sources/c/test_hidden_caller.c
new file mode 100644
index 0000000..4e33285
--- /dev/null
+++ b/addons/tvm_ffi_orcjit/tests/sources/c/test_hidden_caller.c
@@ -0,0 +1,61 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+/*
+ * Caller library for ADRP overflow test (LLVM issue #173269).
+ *
+ * Takes the ADDRESS of hidden_helper_add via ADRP+ADD (no GOT,
+ * because of hidden visibility). When this object and
+ * test_hidden_helper.o are in different mmap allocations >4GB
+ * apart, the ADRP immediate overflows — silent truncation on
+ * AArch64 causes a segfault.
+ *
+ * The arena memory manager fixes this by placing all objects
+ * in contiguous VA space (<< 4GB).
+ */
+#include <stdint.h>
+#include <tvm/ffi/c_api.h>
+
+/* Same hidden declaration — compiler uses ADRP+ADD to take address */
+__attribute__((visibility("hidden"))) extern int64_t hidden_helper_add(int64_t
a, int64_t b);
+
+typedef int64_t (*binop_t)(int64_t, int64_t);
+
+/* call_hidden_add: take address of hidden_helper_add, then call via pointer.
+ On AArch64, generates:
+ ADRP x0, hidden_helper_add@PAGE (R_AARCH64_ADR_PREL_PG_HI21, ±4GB)
+ ADD x0, x0, hidden_helper_add@PAGEOFF (R_AARCH64_ADD_ABS_LO12_NC)
+ When hidden_helper_add is in a different allocation >4GB away, ADRP
overflows. */
+TVM_FFI_DLL_EXPORT int __tvm_ffi_call_hidden_add(void* self, const TVMFFIAny*
args,
+ int32_t num_args, TVMFFIAny*
result) {
+ volatile binop_t fn = &hidden_helper_add;
+ result->type_index = kTVMFFIInt;
+ result->zero_padding = 0;
+ result->v_int64 = fn(args[0].v_int64, args[1].v_int64);
+ return 0;
+}
+
+/* Return the address of this function's code — for co-location tests */
+TVM_FFI_DLL_EXPORT int __tvm_ffi_caller_code_address(void* self, const
TVMFFIAny* args,
+ int32_t num_args,
TVMFFIAny* result) {
+ result->type_index = kTVMFFIInt;
+ result->zero_padding = 0;
+ result->v_int64 = (int64_t)(uintptr_t)&__tvm_ffi_caller_code_address;
+ return 0;
+}
diff --git a/addons/tvm_ffi_orcjit/tests/sources/c/test_hidden_helper.c
b/addons/tvm_ffi_orcjit/tests/sources/c/test_hidden_helper.c
new file mode 100644
index 0000000..811218e
--- /dev/null
+++ b/addons/tvm_ffi_orcjit/tests/sources/c/test_hidden_helper.c
@@ -0,0 +1,53 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+/*
+ * Helper library for ADRP overflow test.
+ * Defines a hidden-visibility function whose ADDRESS is taken
+ * by test_hidden_caller.c. On AArch64, the caller uses
+ * ADRP+ADD (no GOT) to compute the address — this overflows
+ * when the two objects are in different allocations >4GB apart.
+ *
+ * Reference: LLVM issue #173269
+ */
+#include <stdint.h>
+#include <tvm/ffi/c_api.h>
+
+/* Hidden visibility: caller uses ADRP+ADD instead of GOT */
+__attribute__((visibility("hidden"))) int64_t hidden_helper_add(int64_t a,
int64_t b) {
+ return a + b;
+}
+
+/* Export a TVM FFI function that calls hidden_helper_add directly */
+TVM_FFI_DLL_EXPORT int __tvm_ffi_hidden_add(void* self, const TVMFFIAny* args,
int32_t num_args,
+ TVMFFIAny* result) {
+ result->type_index = kTVMFFIInt;
+ result->zero_padding = 0;
+ result->v_int64 = hidden_helper_add(args[0].v_int64, args[1].v_int64);
+ return 0;
+}
+
+/* Return the address of this function's code — for co-location tests */
+TVM_FFI_DLL_EXPORT int __tvm_ffi_helper_code_address(void* self, const
TVMFFIAny* args,
+ int32_t num_args,
TVMFFIAny* result) {
+ result->type_index = kTVMFFIInt;
+ result->zero_padding = 0;
+ result->v_int64 = (int64_t)(uintptr_t)&__tvm_ffi_helper_code_address;
+ return 0;
+}
diff --git a/addons/tvm_ffi_orcjit/tests/sources/cc/fake_fatbin.cc
b/addons/tvm_ffi_orcjit/tests/sources/cc/fake_fatbin.cc
new file mode 100644
index 0000000..8fc1636
--- /dev/null
+++ b/addons/tvm_ffi_orcjit/tests/sources/cc/fake_fatbin.cc
@@ -0,0 +1,64 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements. See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership. The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License. You may obtain a copy of the License at
+//
+// http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied. See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+// Simulates an NVCC-compiled object with a large .nv_fatbin device blob.
+// The fatbin data is referenced only by absolute relocations (R_*_64 /
+// R_AARCH64_ABS64), never by PC-relative relocations. This lets us test
+// overflow-region classification without needing a real CUDA toolchain.
+//
+// KEY DETAIL: References go through a pointer in .data (generates
+// R_AARCH64_ABS64 / R_X86_64_64), not via ADRP/RIP-relative. This
+// mirrors real NVCC output where __NV_fatbin_* uses absolute relocations.
+
+#include <tvm/ffi/c_api.h>
+
+#include <cstdint>
+
+#ifdef __APPLE__
+__attribute__((section("__DATA,.nv_fatbin"), used))
+#else
+__attribute__((section(".nv_fatbin"), used))
+#endif
+static const uint8_t fake_fatbin_data[4 * 1024 * 1024] = {0};
+
+// Indirect reference: .data holds an absolute-relocation pointer to
+// .nv_fatbin. Code accesses .data via PC-relative (ADRP / RIP), and
+// .data→.nv_fatbin is absolute. No PC-relative edge crosses from any
+// section to .nv_fatbin, matching real NVCC objects.
+static const void* const fatbin_ptr = fake_fatbin_data;
+static const uint64_t fatbin_size = sizeof(fake_fatbin_data);
+
+// get_fatbin_size: return the size of the fake fatbin blob.
+extern "C" {
+TVM_FFI_DLL_EXPORT int __tvm_ffi_get_fatbin_size(void* self, const TVMFFIAny*
args,
+ int32_t num_args, TVMFFIAny*
result) {
+ result->type_index = kTVMFFIInt;
+ result->zero_padding = 0;
+ result->v_int64 = static_cast<int64_t>(fatbin_size);
+ return 0;
+}
+
+// get_fatbin_addr: return the address of the fake fatbin data.
+// Used by tests to verify overflow sections land outside the arena.
+TVM_FFI_DLL_EXPORT int __tvm_ffi_get_fatbin_addr(void* self, const TVMFFIAny*
args,
+ int32_t num_args, TVMFFIAny*
result) {
+ result->type_index = kTVMFFIInt;
+ result->zero_padding = 0;
+ result->v_int64 = reinterpret_cast<int64_t>(fatbin_ptr);
+ return 0;
+}
+}
diff --git a/addons/tvm_ffi_orcjit/tests/sources/cc/test_addr.cc
b/addons/tvm_ffi_orcjit/tests/sources/cc/test_addr.cc
new file mode 100644
index 0000000..43bb504
--- /dev/null
+++ b/addons/tvm_ffi_orcjit/tests/sources/cc/test_addr.cc
@@ -0,0 +1,26 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements. See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership. The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License. You may obtain a copy of the License at
+//
+// http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied. See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+// Returns the code address of this function — for arena co-location tests.
+// Load into multiple libraries to verify they land in the same arena region.
+
+#include <tvm/ffi/function.h>
+
+#include <cstdint>
+
+int64_t code_address_impl() { return
reinterpret_cast<int64_t>(&code_address_impl); }
+TVM_FFI_DLL_EXPORT_TYPED_FUNC(code_address, code_address_impl);
diff --git a/addons/tvm_ffi_orcjit/tests/sources/cc/test_hidden_caller.cc
b/addons/tvm_ffi_orcjit/tests/sources/cc/test_hidden_caller.cc
new file mode 100644
index 0000000..babd6d3
--- /dev/null
+++ b/addons/tvm_ffi_orcjit/tests/sources/cc/test_hidden_caller.cc
@@ -0,0 +1,52 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements. See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership. The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License. You may obtain a copy of the License at
+//
+// http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied. See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+// Caller library for ADRP overflow test (LLVM issue #173269).
+//
+// Takes the ADDRESS of hidden_helper_add via ADRP+ADD (no GOT,
+// because of hidden visibility). When this object and
+// test_hidden_helper.o are in different mmap allocations >4GB
+// apart, the ADRP immediate overflows — silent truncation on
+// AArch64 causes a segfault.
+//
+// The arena memory manager fixes this by placing all objects
+// in contiguous VA space (<< 4GB).
+
+#include <tvm/ffi/c_api.h>
+
+#include <cstdint>
+
+// Same hidden declaration — compiler uses ADRP+ADD to take address
+__attribute__((visibility("hidden"))) extern int64_t hidden_helper_add(int64_t
a, int64_t b);
+
+using binop_t = int64_t (*)(int64_t, int64_t);
+
+// call_hidden_add: take address of hidden_helper_add, then call via pointer.
+// On AArch64, generates:
+// ADRP x0, hidden_helper_add@PAGE (R_AARCH64_ADR_PREL_PG_HI21, ±4GB)
+// ADD x0, x0, hidden_helper_add@PAGEOFF (R_AARCH64_ADD_ABS_LO12_NC)
+// When hidden_helper_add is in a different allocation >4GB away, ADRP
overflows.
+extern "C" {
+TVM_FFI_DLL_EXPORT int __tvm_ffi_call_hidden_add(void* self, const TVMFFIAny*
args,
+ int32_t num_args, TVMFFIAny*
result) {
+ volatile binop_t fn = &hidden_helper_add;
+ result->type_index = kTVMFFIInt;
+ result->zero_padding = 0;
+ result->v_int64 = fn(args[0].v_int64, args[1].v_int64);
+ return 0;
+}
+}
diff --git a/addons/tvm_ffi_orcjit/tests/sources/cc/test_hidden_helper.cc
b/addons/tvm_ffi_orcjit/tests/sources/cc/test_hidden_helper.cc
new file mode 100644
index 0000000..8de2de1
--- /dev/null
+++ b/addons/tvm_ffi_orcjit/tests/sources/cc/test_hidden_helper.cc
@@ -0,0 +1,44 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements. See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership. The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License. You may obtain a copy of the License at
+//
+// http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied. See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+// Helper library for ADRP overflow test.
+// Defines a hidden-visibility function whose ADDRESS is taken
+// by test_hidden_caller.cc. On AArch64, the caller uses
+// ADRP+ADD (no GOT) to compute the address — this overflows
+// when the two objects are in different allocations >4GB apart.
+//
+// Reference: LLVM issue #173269
+
+#include <tvm/ffi/c_api.h>
+
+#include <cstdint>
+
+// Hidden visibility: caller uses ADRP+ADD instead of GOT
+__attribute__((visibility("hidden"))) int64_t hidden_helper_add(int64_t a,
int64_t b) {
+ return a + b;
+}
+
+// Export a TVM FFI function that calls hidden_helper_add directly
+extern "C" {
+TVM_FFI_DLL_EXPORT int __tvm_ffi_hidden_add(void* self, const TVMFFIAny* args,
int32_t num_args,
+ TVMFFIAny* result) {
+ result->type_index = kTVMFFIInt;
+ result->zero_padding = 0;
+ result->v_int64 = hidden_helper_add(args[0].v_int64, args[1].v_int64);
+ return 0;
+}
+}
diff --git a/addons/tvm_ffi_orcjit/tests/test_arena.py
b/addons/tvm_ffi_orcjit/tests/test_arena.py
new file mode 100644
index 0000000..a5ad22a
--- /dev/null
+++ b/addons/tvm_ffi_orcjit/tests/test_arena.py
@@ -0,0 +1,674 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements. See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership. The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License. You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied. See the License for the
+# specific language governing permissions and limitations
+# under the License.
+"""Tests for JIT memory arena — verifies co-location and relocation safety.
+
+Background
+----------
+LLVM ORC JIT v2 uses ``InProcessMemoryMapper`` (backed by
+``MapperJITLinkMemoryManager``) to allocate JIT memory. Each allocation
+is a separate ``mmap(MAP_ANONYMOUS)`` call whose address the kernel picks.
+Under virtual-address (VA) pressure — leaked slabs from failed
+materializations, long-running pytest sessions holding tracebacks, or
+simply a fragmented address space — the kernel can place successive
+allocations far apart.
+
+This matters for **PC-relative relocations with limited range**:
+
+- **x86_64 R_X86_64_PC32 / Delta32**: ±2 GB range. GCC-compiled C++
+ objects reference ``__dso_handle`` (used by ``__cxa_atexit`` for DSO
+ identification) via PC32 when the symbol has hidden visibility.
+ LLVM's ``ELFNixPlatform`` defines ``__dso_handle`` per JITDylib in a
+ separate ``DSOHandleMaterializationUnit`` — a tiny ``LinkGraph``
+ allocated independently of the code that references it. If those two
+ allocations land >2 GB apart, the Delta32 fixup overflows.
+
+- **AArch64 ADRP+ADD**: ±4 GB range. Hidden-visibility cross-object
+ calls use ADRP (page-relative) which has the same scatter problem
+ at a wider threshold.
+
+The **arena memory manager** solves this by pre-reserving a contiguous
+VA region (default 4 GB x86_64 / 8 GB AArch64, with fallback to smaller
+sizes) via ``mmap(PROT_NONE)`` and bump-allocating within it, guaranteeing
+all JIT allocations stay within relocation range regardless of external
+VA pressure.
+
+Note on ``-fPIC`` vs ``-fpie``
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+With ``-fPIC`` (the default for shared-library code), GCC may use
+``R_X86_64_GOTPCRELX`` (GOT-relative, load through the GOT) for
+hidden-visibility externals like ``__dso_handle``. GOT entries are
+co-located with code, so there is no ±2 GB range issue. With ``-fpie``
+(position-independent executable), GCC prefers the shorter direct
+``R_X86_64_PC32``, which *does* have the ±2 GB limit. The Delta32
+overflow tests (test 6) therefore build with ``-fpie`` to force the
+problematic relocation type.
+
+Test structure
+--------------
+1. **Co-location** (test 1): arena keeps objects within 16 MB.
+2. **Scatter baseline** (test 2): without arena, VA blocker pushes
+ objects >128 MB apart — proves arena is responsible for co-location.
+3. **Hidden-symbol calls** (test 3): ADRP/PC32 cross-object calls
+ succeed under VA pressure with arena.
+4. **Large data section** (test 4): 4 MB ``.nv_fatbin`` section loads
+ correctly within the arena.
+5. **Overflow section** (test 5): ``.nv_fatbin`` data is allocated
+ outside the arena via separate mmap.
+6. **Leaked materialization** (test 6): ``__dso_handle`` resolves after
+ prior sessions leaked mmap slabs from failed materializations.
+7. **Delta32 overflow** (test 7): ``-fpie`` GCC objects + 3 GB VA
+ blocker. With arena → PASSES; without arena → Delta32 overflow.
+
+All tests use a small arena (16 MB) and 256 MB-3 GB VA blockers -- safe
+for CI containers.
+"""
+
+from __future__ import annotations
+
+import ctypes
+import ctypes.util
+import functools
+import platform
+import sys
+from pathlib import Path
+
+import pytest
+from tvm_ffi_orcjit import ExecutionSession
+from utils import build_test_objects
+
+# ---------------------------------------------------------------------------
+# Setup
+# ---------------------------------------------------------------------------
+
+OBJ_DIR = build_test_objects()
+
+_KNOWN_SUBDIRS = [
+ "c",
+ "c-gcc",
+ "cc",
+ "cc-gcc",
+ "cc-gcc-pie",
+ "c-appleclang",
+ "cc-appleclang",
+ "c-msvc",
+ "c-clang-cl",
+]
+
+_PIE_VARIANT_MARKER = "-pie"
+
+
+def obj(name: str) -> str:
+ """Return path to a pre-built test object file, or skip if missing."""
+ path = OBJ_DIR / f"{name}.o"
+ if not path.exists():
+ pytest.skip(f"{path.name} not found (not built)")
+ return str(path)
+
+
+def _discover_c_variants() -> list[str]:
+ """Discover available C-only compiler variants."""
+ return [
+ s
+ for s in _KNOWN_SUBDIRS
+ if s.startswith("c")
+ and not s.startswith("cc")
+ and _PIE_VARIANT_MARKER not in s
+ and (OBJ_DIR / s / "test_funcs.o").exists()
+ ]
+
+
+def _discover_cpp_variants() -> list[str]:
+ """Discover available C++ compiler variants (for __dso_handle tests)."""
+ return [
+ s
+ for s in _KNOWN_SUBDIRS
+ if s.startswith("cc")
+ and _PIE_VARIANT_MARKER not in s
+ and (OBJ_DIR / s / "test_funcs.o").exists()
+ ]
+
+
+def _discover_gcc_cpp_variants() -> list[str]:
+ """Discover GCC C++ variants (emit R_X86_64_PC32 for __dso_handle)."""
+ return [v for v in _discover_cpp_variants() if "gcc" in v]
+
+
+def _discover_pie_cpp_variants() -> list[str]:
+ """Discover PIE C++ variants built with -fpie.
+
+ PIE objects force R_X86_64_PC32 (direct, ±2GB) for __dso_handle
+ instead of R_X86_64_GOTPCRELX (GOT-relative, unlimited range).
+ Used exclusively by the Delta32 overflow tests (test 6).
+ """
+ return [
+ s
+ for s in _KNOWN_SUBDIRS
+ if _PIE_VARIANT_MARKER in s and (OBJ_DIR / s / "test_funcs.o").exists()
+ ]
+
+
+_c_variants = _discover_c_variants()
+_cpp_variants = _discover_cpp_variants()
+_gcc_cpp_variants = _discover_gcc_cpp_variants()
+_pie_cpp_variants = _discover_pie_cpp_variants()
+_all_variants = _c_variants + _cpp_variants
+
+_is_linux = sys.platform == "linux"
+_is_x86_64 = platform.machine() in ("x86_64", "AMD64")
+
+# Arena test parameters
+_ARENA_SIZE = 16 * 1024 * 1024 # 16MB — small arena for testing
+_BLOCK_RADIUS = 256 * 1024 * 1024 # 256MB — safe for CI containers
+_DSO_BLOCK_RADIUS = 3 * 1024 * 1024 * 1024 # 3GB — needed to overflow PC32
(±2GB)
+
+_PROT_NONE = 0
+_MAP_PRIVATE_ANON = 0x22 # MAP_PRIVATE | MAP_ANONYMOUS
+_MAP_FIXED_NOREPLACE = 0x100000
+
+
+# ---------------------------------------------------------------------------
+# VA blocker — fills nearby free VA gaps to force distant mmap placement
+# ---------------------------------------------------------------------------
+
+
[email protected]_cache(maxsize=1)
+def _get_libc() -> ctypes.CDLL:
+ """Get a ctypes handle to libc with correct mmap/munmap signatures."""
+ libc = ctypes.CDLL(ctypes.util.find_library("c") or "libc.so.6",
use_errno=True)
+ libc.mmap.restype = ctypes.c_void_p
+ libc.mmap.argtypes = [
+ ctypes.c_void_p,
+ ctypes.c_size_t,
+ ctypes.c_int,
+ ctypes.c_int,
+ ctypes.c_int,
+ ctypes.c_long,
+ ]
+ libc.munmap.restype = ctypes.c_int
+ libc.munmap.argtypes = [ctypes.c_void_p, ctypes.c_size_t]
+ return libc
+
+
+def _parse_maps() -> list[tuple[int, int]]:
+ """Parse /proc/self/maps into sorted list of (start, end) tuples."""
+ regions = []
+ with Path("/proc/self/maps").open() as f:
+ for line in f:
+ addrs = line.split()[0].split("-")
+ regions.append((int(addrs[0], 16), int(addrs[1], 16)))
+ return sorted(regions)
+
+
+def _find_new_mappings(
+ before: set[tuple[int, int]], after: list[tuple[int, int]]
+) -> list[tuple[int, int]]:
+ """Find mappings present in *after* but not in *before*."""
+ return [(s, e) for s, e in after if (s, e) not in before]
+
+
+def block_nearby_va(center: int, radius: int = _BLOCK_RADIUS) ->
list[tuple[int, int]]:
+ """Block all free VA gaps within *radius* of *center*.
+
+ Uses MAP_FIXED_NOREPLACE to place PROT_NONE mappings in every free gap
+ within [center - radius, center + radius]. This forces subsequent
+ mmap(NULL, ...) calls to land outside the blocked region.
+
+ Returns list of (addr, size) blockers to be freed later.
+ """
+ libc = _get_libc()
+ maps = _parse_maps()
+ blockers = []
+ low = max(center - radius, 0)
+ high = center + radius
+
+ for i in range(len(maps) - 1):
+ gap_start = maps[i][1]
+ gap_end = maps[i + 1][0]
+ if gap_end <= low or gap_start >= high or gap_end <= gap_start:
+ continue
+ block_start = max(gap_start, low)
+ block_end = min(gap_end, high)
+ block_size = block_end - block_start
+ if block_size <= 0:
+ continue
+ addr = libc.mmap(
+ block_start, block_size, _PROT_NONE, _MAP_PRIVATE_ANON |
_MAP_FIXED_NOREPLACE, -1, 0
+ )
+ if addr != ctypes.c_void_p(-1).value and addr is not None:
+ blockers.append((addr, block_size))
+
+ return blockers
+
+
+def free_blockers(blockers: list[tuple[int, int]]) -> None:
+ """Free all VA blockers."""
+ libc = _get_libc()
+ for addr, size in blockers:
+ libc.munmap(addr, size)
+
+
+# ---------------------------------------------------------------------------
+# Test 1: Arena co-location — objects stay within arena range
+# ---------------------------------------------------------------------------
+
+
[email protected](not _is_linux, reason="Arena is Linux-only")
[email protected]("variant", _all_variants)
+def test_arena_colocation(variant: str) -> None:
+ """With arena, objects in separate libraries have close code addresses.
+
+ Uses a 16MB arena and inserts a 256MB VA blocker between object loads.
+ Without the arena, the blocker would push the second object far away.
+ With the arena, both objects land within the 16MB region.
+ """
+ maps_before = set(_parse_maps())
+
+ session = ExecutionSession(arena_size=_ARENA_SIZE)
+ lib1 = session.create_library("lib1")
+ lib1.add(obj(f"{variant}/test_addr"))
+ addr1 = lib1.get_function("code_address")()
+
+ # Find where LLVM placed the first allocation and block nearby VA
+ maps_after = _parse_maps()
+ new_maps = _find_new_mappings(maps_before, maps_after)
+ jit_center = max(s for s, e in new_maps) if new_maps else addr1
+
+ blockers = block_nearby_va(jit_center)
+ try:
+ lib2 = session.create_library("lib2")
+ lib2.add(obj(f"{variant}/test_addr"))
+ addr2 = lib2.get_function("code_address")()
+ finally:
+ free_blockers(blockers)
+
+ distance = abs(addr1 - addr2)
+ assert distance < _ARENA_SIZE, (
+ f"Objects should be within {_ARENA_SIZE} bytes, "
+ f"but distance is {distance} ({distance / (1024**2):.1f} MB)"
+ )
+
+
+# ---------------------------------------------------------------------------
+# Test 2: Arena effect — compare with-arena vs without-arena under VA pressure
+# ---------------------------------------------------------------------------
+
+
+def _measure_distance_under_pressure(
+ variant: str, arena_size: int, radius: int = _BLOCK_RADIUS
+) -> tuple[int | None, bool]:
+ """Load two objects under VA pressure and return (distance, overflowed).
+
+ Returns ``(distance_bytes, False)`` when both objects load successfully,
+ or ``(None, True)`` when the second load fails with a relocation overflow
+ (Page21 on AArch64, Delta32 on x86_64).
+ """
+ maps_before = set(_parse_maps())
+
+ session = ExecutionSession(arena_size=arena_size)
+ lib1 = session.create_library("lib1")
+ lib1.add(obj(f"{variant}/test_addr"))
+ addr1 = lib1.get_function("code_address")()
+
+ maps_after = _parse_maps()
+ new_maps = _find_new_mappings(maps_before, maps_after)
+ jit_center = max(s for s, e in new_maps) if new_maps else addr1
+
+ blockers = block_nearby_va(jit_center, radius=radius)
+ try:
+ lib2 = session.create_library("lib2")
+ try:
+ lib2.add(obj(f"{variant}/test_addr"))
+ addr2 = lib2.get_function("code_address")()
+ except Exception:
+ return None, True
+ finally:
+ free_blockers(blockers)
+
+ return abs(addr1 - addr2), False
+
+
[email protected](not _is_linux, reason="Arena is Linux-only")
[email protected]("variant", _all_variants)
+def test_arena_keeps_objects_close(variant: str) -> None:
+ """Arena co-locates objects that would otherwise scatter or overflow.
+
+ Runs the same workload twice under identical VA pressure — once with
+ the arena and once without — and compares the outcomes:
+
+ - **With arena**: both objects must land within the arena size (16 MB).
+ - **Without arena**: the blocker should either cause a relocation
+ overflow (proving scatter beyond relocation range) or produce a
+ measurably larger distance.
+
+ The test proves the arena is responsible for co-location by showing a
+ strictly better outcome with it enabled. If the VA blocker happens to
+ be ineffective (e.g., LLVM slab reuse on 64k-page kernels), the test
+ still passes as long as the arena keeps objects within range.
+ """
+ # Phase 1: with arena — must always succeed and be within arena range
+ arena_dist, arena_overflow = _measure_distance_under_pressure(variant,
arena_size=_ARENA_SIZE)
+ assert not arena_overflow, "Arena session should not overflow"
+ assert arena_dist is not None
+ assert arena_dist < _ARENA_SIZE, (
+ f"With arena, objects should be within {_ARENA_SIZE} bytes, "
+ f"but distance is {arena_dist} ({arena_dist / (1024**2):.1f} MB)"
+ )
+
+ # Phase 2: without arena — expect scatter or overflow
+ no_arena_dist, no_arena_overflow =
_measure_distance_under_pressure(variant, arena_size=-1)
+
+ if no_arena_overflow:
+ # Relocation overflow without arena proves the blocker forced
+ # scatter beyond relocation range — arena prevented this.
+ return
+
+ assert no_arena_dist is not None
+ if no_arena_dist > arena_dist:
+ # Without arena produced a larger distance — arena effect shown.
+ return
+
+ # Blocker was ineffective (both distances are small). The arena
+ # assertion above already passed, which is the key property. We
+ # cannot distinguish arena effect from lucky placement here.
+ pytest.skip(
+ f"VA blocker ineffective: arena={arena_dist / 1024:.0f} KB, "
+ f"no-arena={no_arena_dist / 1024:.0f} KB — "
+ f"cannot demonstrate arena effect on this kernel"
+ )
+
+
+# ---------------------------------------------------------------------------
+# Test 3: Hidden-symbol ADRP/PC32 relocation with arena + blocker
+# ---------------------------------------------------------------------------
+
+
[email protected](not _is_linux, reason="Arena is Linux-only")
[email protected]("variant", _all_variants)
+def test_arena_hidden_symbol_with_blocker(variant: str) -> None:
+ """Arena prevents hidden-visibility relocation overflow under VA pressure.
+
+ Loads two objects with hidden-visibility cross-references (ADRP+ADD
+ on AArch64, PC32 on x86_64) with a VA blocker between them.
+ Without arena, the blocker would push objects apart causing overflow.
+ With the arena, both objects are co-located and the call succeeds.
+ """
+ maps_before = set(_parse_maps())
+
+ session = ExecutionSession(arena_size=_ARENA_SIZE)
+ lib = session.create_library("hidden_test")
+
+ # Load helper and force materialization
+ lib.add(obj(f"{variant}/test_hidden_helper"))
+ assert lib.get_function("hidden_add")(1, 2) == 3
+
+ # Block nearby VA to force scatter
+ maps_after = _parse_maps()
+ new_maps = _find_new_mappings(maps_before, maps_after)
+ jit_center = max(s for s, e in new_maps) if new_maps else 0xFFFF00000000
+
+ blockers = block_nearby_va(jit_center)
+ try:
+ lib.add(obj(f"{variant}/test_hidden_caller"))
+ fn = lib.get_function("call_hidden_add")
+ assert fn(10, 20) == 30
+ finally:
+ free_blockers(blockers)
+
+
+# ---------------------------------------------------------------------------
+# Test 4: Large data section (simulated .nv_fatbin)
+# ---------------------------------------------------------------------------
+
+
[email protected](not _is_linux, reason="Arena is Linux-only")
[email protected]("variant", _all_variants)
+def test_large_data_section(variant: str) -> None:
+ """Load object with a 4MB .nv_fatbin section — basic correctness.
+
+ The .nv_fatbin section is referenced only by absolute relocations,
+ so it can live anywhere. This test verifies the object loads and
+ the function works. The 4MB section fits in the arena.
+ """
+ session = ExecutionSession()
+ lib = session.create_library("fatbin")
+ lib.add(obj(f"{variant}/fake_fatbin"))
+ fn = lib.get_function("get_fatbin_size")
+ assert fn() == 4 * 1024 * 1024
+
+
+# ---------------------------------------------------------------------------
+# Test 5: Overflow section — .nv_fatbin lands outside the arena
+# ---------------------------------------------------------------------------
+
+
[email protected](not _is_linux, reason="Arena is Linux-only")
[email protected]("variant", _all_variants)
+def test_overflow_section_outside_arena(variant: str) -> None:
+ """Overflow sections (.nv_fatbin) are allocated outside the arena.
+
+ The arena memory manager detects sections named .nv_fatbin and
+ allocates them via a separate mmap() outside the arena. This keeps
+ the arena compact for code + small rodata, reducing 2MB THP region
+ count and iTLB pressure.
+
+ Verification: get the fatbin data address and the arena VA range
+ from /proc/self/maps, then assert the fatbin address is NOT within
+ the arena region.
+ """
+ session = ExecutionSession(arena_size=_ARENA_SIZE)
+ lib = session.create_library("fatbin_overflow")
+ lib.add(obj(f"{variant}/fake_fatbin"))
+
+ # Verify the function still works correctly.
+ assert lib.get_function("get_fatbin_size")() == 4 * 1024 * 1024
+
+ # Get the actual address of the fatbin data in memory.
+ fatbin_addr = lib.get_function("get_fatbin_addr")()
+
+ # Find the arena mapping: a single large region matching the arena size.
+ # The arena is reserved as PROT_NONE and then committed in slabs, so
+ # look for the contiguous region that spans _ARENA_SIZE.
+ maps = _parse_maps()
+ arena_regions = [(s, e) for s, e in maps if (e - s) >= _ARENA_SIZE]
+
+ # The fatbin address must not fall within any arena-sized region.
+ for start, end in arena_regions:
+ assert not (start <= fatbin_addr < end), (
+ f"Fatbin data at {fatbin_addr:#x} should be OUTSIDE the arena "
+ f"[{start:#x}, {end:#x}) but landed inside"
+ )
+
+
+# ---------------------------------------------------------------------------
+# Test 6: __dso_handle Delta32 overflow after leaked materialization
+# ---------------------------------------------------------------------------
+
+
[email protected](not _is_linux, reason="ELF/GCC-specific __dso_handle test")
[email protected]("variant", _cpp_variants)
+def test_dso_handle_relocation_after_failed_materialization(variant: str) ->
None:
+ """__dso_handle resolves correctly after leaked JIT memory.
+
+ Mechanism
+ ---------
+ GCC C++ objects call ``__cxa_atexit(&destructor, &obj, __dso_handle)``
+ for static-storage-duration objects. The ``__dso_handle`` symbol is
+ emitted as ``GLOBAL HIDDEN UND`` in each object file. LLVM's
+ ``ELFNixPlatform`` defines it per JITDylib via a separate
+ ``DSOHandleMaterializationUnit`` — a self-referential pointer block
+ (``void *__dso_handle = &__dso_handle;``) allocated in its own
+ ``LinkGraph`` through ``ObjectLinkingLayer``.
+
+ When a prior ``lib.add()`` fails (e.g., duplicate symbol), LLVM's
+ ``InProcessMemoryMapper`` leaks the mmap'd slab for that failed
+ materialization. If the process holds references to the old session
+ (e.g., pytest keeping ``sys.exc_info()`` tracebacks alive), the
+ leaked slabs accumulate and push subsequent ``mmap`` allocations to
+ higher addresses.
+
+ The arena prevents overflow because all allocations — both
+ ``__dso_handle``'s ``LinkGraph`` and the code ``LinkGraph`` — land
+ within the same contiguous pre-reserved VA region.
+
+ Without arena: may FAIL on x86_64 with GCC PIE objects after
+ repeated leaked materializations push slabs >2 GB
+ apart.
+ With arena: PASSES (all allocations in same arena).
+ """
+ # Step 1: Trigger leaked materializations to consume low VA space.
+ leaked_sessions = []
+ for _ in range(3):
+ s0 = ExecutionSession()
+ lib0 = s0.create_library("warmup")
+ lib0.add(obj(f"{variant}/test_funcs"))
+ lib0.get_function("test_add")(10, 20)
+ try:
+ lib0.add(obj(f"{variant}/test_funcs_conflict"))
+ except Exception:
+ pass
+ leaked_sessions.append((s0, lib0))
+
+ # Step 2: Fresh session — cross-library resolution must still work.
+ session = ExecutionSession()
+ lib1 = session.create_library("lib1")
+ lib1.add(obj(f"{variant}/test_funcs"))
+ assert lib1.get_function("test_add")(10, 20) == 30
+
+ lib2 = session.create_library("lib2")
+ lib2.add(obj(f"{variant}/test_funcs_conflict"))
+ assert lib2.get_function("test_add")(10, 20) == 1030
+
+
+# ---------------------------------------------------------------------------
+# Test 6: __dso_handle Delta32 overflow — arena prevents it (x86_64 PIE)
+#
+# GCC -fpie objects use R_X86_64_PC32 (±2GB) for __dso_handle.
+# ELFNixPlatform's DSOHandleMaterializationUnit allocates __dso_handle
+# in a separate LinkGraph from the code. Under VA pressure, these two
+# allocations can land >2GB apart, overflowing the Delta32 fixup.
+# The arena keeps them co-located within relocation range.
+# ---------------------------------------------------------------------------
+
+
[email protected](not _is_linux, reason="Arena is Linux-only")
[email protected](not _is_x86_64, reason="Delta32 overflow requires x86_64")
[email protected](not _pie_cpp_variants, reason="No GCC PIE C++ variants
built")
[email protected]("variant", _pie_cpp_variants or ["skip"])
+def test_dso_handle_delta32_with_arena(variant: str) -> None:
+ """Arena prevents __dso_handle Delta32 overflow under VA pressure.
+
+ Root cause
+ ----------
+ GCC C++ objects built with ``-fpie`` emit ``R_X86_64_PC32`` (Delta32,
+ ±2 GB) relocations for ``__dso_handle`` because the symbol has hidden
+ visibility and ``-fpie`` prefers direct PC-relative over GOT-relative.
+ (With ``-fPIC``, GCC uses ``R_X86_64_GOTPCRELX`` which goes through
+ the GOT — always co-located with code, so no range issue.)
+
+ ``ELFNixPlatform`` defines ``__dso_handle`` per JITDylib in a separate
+ ``DSOHandleMaterializationUnit``. This creates a tiny ``LinkGraph``
+ (a self-referential pointer: ``void *__dso_handle = &__dso_handle;``)
+ that is allocated through ``ObjectLinkingLayer`` independently of the
+ code ``LinkGraph`` from ``lib.add()``. Both allocations go through
+ ``InProcessMemoryMapper`` → ``mmap(MAP_ANONYMOUS)``, whose placement
+ the kernel decides.
+
+ Test strategy
+ -------------
+ 1. Create a session with arena enabled (16 MB).
+ 2. Load PIE GCC objects into lib1 — this triggers materialization of
+ both ``__dso_handle`` (via ``DSOHandleMaterializationUnit``) and
+ the code (via ``lib.add``), all within the arena.
+ 3. Block 3 GB of VA around the first allocation — without arena this
+ would force the next ``mmap`` to land >2 GB away.
+ 4. Load a second PIE GCC object into lib2 — with arena, this still
+ lands within the 16 MB region.
+ 5. Assert the function call succeeds — proves Delta32 is in range.
+
+ See ``test_dso_handle_delta32_overflow_without_arena`` for the
+ counterpart proving the overflow occurs without arena.
+ """
+ maps_before = set(_parse_maps())
+
+ session = ExecutionSession(arena_size=_ARENA_SIZE)
+ lib1 = session.create_library("lib1")
+ lib1.add(obj(f"{variant}/test_funcs"))
+ assert lib1.get_function("test_add")(10, 20) == 30
+
+ # Block 3GB of VA around the first allocation to force scatter
+ maps_after = _parse_maps()
+ new_maps = _find_new_mappings(maps_before, maps_after)
+ jit_center = max(s for s, e in new_maps) if new_maps else 0xFFFF00000000
+
+ blockers = block_nearby_va(jit_center, radius=_DSO_BLOCK_RADIUS)
+ try:
+ lib2 = session.create_library("lib2")
+ lib2.add(obj(f"{variant}/test_funcs_conflict"))
+ assert lib2.get_function("test_add")(10, 20) == 1030
+ finally:
+ free_blockers(blockers)
+
+
[email protected](not _is_linux, reason="Arena is Linux-only")
[email protected](not _is_x86_64, reason="Delta32 overflow requires x86_64")
[email protected](not _pie_cpp_variants, reason="No GCC PIE C++ variants
built")
[email protected]("variant", _pie_cpp_variants or ["skip"])
+def test_dso_handle_delta32_overflow_without_arena(variant: str) -> None:
+ """Without arena, PIE __dso_handle PC32 overflows under VA pressure.
+
+ Same setup as ``test_dso_handle_delta32_with_arena`` but with arena
+ disabled (``arena_size=-1``).
+
+ The 3 GB VA blocker fills all free gaps within ±3 GB of the first
+ session's JIT allocations. When lib2 is loaded, ``InProcessMemoryMapper``
+ calls ``mmap(MAP_ANONYMOUS)`` for a new slab, but the only free VA is
+ >3 GB away. The code ``LinkGraph`` from ``lib2.add()`` lands in that
+ distant slab, while ``__dso_handle`` was already materialized with
+ lib1's ``DSOHandleMaterializationUnit`` in the original region. The
+ ``R_X86_64_PC32`` fixup from code to ``__dso_handle`` now exceeds
+ ±2 GB → JITLink reports ``Delta32 fixup ... is out of range``.
+
+ The test accepts both outcomes:
+ - **Exception** (PC32 overflow): proves the arena is needed.
+ - **Success** (GOTPCRELX used): GCC chose GOT-relative despite
+ ``-fpie`` — no overflow possible, but the arena is still
+ beneficial for other relocation types.
+ """
+ maps_before = set(_parse_maps())
+
+ session = ExecutionSession(arena_size=-1) # arena disabled
+ lib1 = session.create_library("lib1")
+ lib1.add(obj(f"{variant}/test_addr"))
+ lib1.get_function("code_address")()
+
+ maps_after = _parse_maps()
+ new_maps = _find_new_mappings(maps_before, maps_after)
+ jit_center = max(s for s, e in new_maps) if new_maps else 0xFFFF00000000
+
+ blockers = block_nearby_va(jit_center, radius=_DSO_BLOCK_RADIUS)
+ try:
+ lib2 = session.create_library("lib2")
+ try:
+ lib2.add(obj(f"{variant}/test_funcs_conflict"))
+ result = lib2.get_function("test_add")(10, 20)
+ # If we get here, GCC used GOTPCRELX — no overflow.
+ assert result == 1030
+ except Exception:
+ # R_X86_64_PC32 overflow as expected — proves arena is needed.
+ pass
+ finally:
+ free_blockers(blockers)
diff --git a/addons/tvm_ffi_orcjit/tests/utils.py
b/addons/tvm_ffi_orcjit/tests/utils.py
index e1fbc96..2a90ff2 100644
--- a/addons/tvm_ffi_orcjit/tests/utils.py
+++ b/addons/tvm_ffi_orcjit/tests/utils.py
@@ -58,12 +58,20 @@ def _extra_cflags() -> list[str]:
return []
+def _extra_cuda_cflags() -> list[str]:
+ machine = platform.machine()
+ if machine in ("aarch64", "arm64"):
+ return ["-Xcompiler", "-mno-outline-atomics"]
+ return []
+
+
def _build_objects(
src_dir: Path,
out_dir: Path,
*,
ext_glob: str,
extra_cflags: list[str],
+ extra_cuda_cflags: list[str] | None = None,
) -> None:
"""Compile all sources in *src_dir* to object files in *out_dir*."""
out_dir.mkdir(parents=True, exist_ok=True)
@@ -78,6 +86,7 @@ def _build_objects(
sources=[str(src)],
output=f"{src.stem}.o",
extra_cflags=extra_cflags,
+ extra_cuda_cflags=extra_cuda_cflags or [],
build_directory=str(build_dir),
)
shutil.copy2(obj_path, dest)
@@ -183,6 +192,17 @@ def build_test_objects(out_dir: Path | None = None) ->
Path:
c_outdir=out_dir / "c-gcc",
cc_outdir=out_dir / "cc-gcc",
)
+ # PIE variant: -fpie forces R_X86_64_PC32 for hidden-visibility
+ # externals like __dso_handle (instead of GOTPCRELX with -fPIC).
+ # Used to reproduce __dso_handle Delta32 overflow on x86_64.
+ _build_variant(
+ "GCC (PIE)",
+ cc=None,
+ cxx="g++",
+ extra_cflags=[*extra, "-fpie"],
+ c_outdir=out_dir / "c-gcc-pie",
+ cc_outdir=out_dir / "cc-gcc-pie",
+ )
if system == "Darwin" and Path("/usr/bin/clang").exists():
_build_variant(
"Apple Clang",
@@ -227,6 +247,12 @@ def build_test_objects(out_dir: Path | None = None) ->
Path:
# CUDA (platform-independent, uses nvcc)
if shutil.which("nvcc"):
- _build_objects(SOURCES_CUDA, out_dir / "cuda", ext_glob="*.cu",
extra_cflags=[])
+ _build_objects(
+ SOURCES_CUDA,
+ out_dir / "cuda",
+ ext_glob="*.cu",
+ extra_cflags=[],
+ extra_cuda_cflags=_extra_cuda_cflags(),
+ )
return out_dir