(tvm-ffi) branch main updated: [OrcJIT] Arena JITLinkMemoryManager with GOTPCRELX fix (Linux) (#527)

tqchen Sun, 26 Apr 2026 12:32:46 -0700

This is an automated email from the ASF dual-hosted git repository.

tqchen pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/tvm-ffi.git



The following commit(s) were added to refs/heads/main by this push:
     new 3c35034  [OrcJIT] Arena JITLinkMemoryManager with GOTPCRELX fix 
(Linux) (#527)
3c35034 is described below

commit 3c35034fd1026011736e19a4e0e1ed0f22058c42
Author: Yaxing Cai <[email protected]>
AuthorDate: Mon Apr 27 03:31:02 2026 +0800

    [OrcJIT] Arena JITLinkMemoryManager with GOTPCRELX fix (Linux) (#527)
    
    ## Summary
    
    Adds an arena-based `JITLinkMemoryManager` that eliminates
    scattered-mmap relocation overflow in LLVM ORC JIT under ASLR / VA
    pressure ([LLVM
    #173269](https://github.com/llvm/llvm-project/issues/173269)), plus a
    workaround for an x86_64 JITLink GOTPCRELX relaxation bug. Linux only;
    other platforms fall back to the default `InProcessMemoryManager`.
    
    ### Arena memory manager (`orcjit_arena_mm.{h,cc}`)
    
    - Pre-reserves one contiguous VA region via `mmap(PROT_NONE |
    MAP_NORESERVE)` at session startup and bump-allocates from it,
    guaranteeing all JIT allocations stay within PC-relative range (±2 GB
    x86_64, ±4 GB AArch64).
    - Default capacity: 4 GB (x86_64) / 8 GB (AArch64). On reservation
    failure (RLIMIT_AS, containers) the constructor halves down to a 256 MB
    floor.
    - **Dual-pool split.** Arena is partitioned at a 2 MB-aligned midpoint
    into a non-exec pool (`r--`/`rw-`) and an exec pool (`r-x`). Exec
    segments pack tightly into whole 2 MB pages for contiguous r-x layout
    and TLB-friendly huge-page promotion. Both pools are capped so
    cross-pool Delta32 fixups always resolve inside ±2 GB.
    - **Slab commit with THP.** Physical pages are committed in 2 MB slabs,
    matching Linux huge page size. `madvise(MADV_HUGEPAGE)` on the full
    reservation lets the kernel promote fully-faulted slabs to single TLB
    entries.
    - **Overflow sections.** Known large absolute-only sections
    (`.nv_fatbin`) are routed to separate `mmap()` allocations outside the
    arena. Guarded by a two-phase check: name-based candidate selection,
    then edge validation that disqualifies any section targeted by a
    PC-relative reference.
    - **Segment-lifetime handling.** `Finalize`-lifetime pages are freed at
    the end of `finalize()`; `Standard`-lifetime pages remain until
    `deallocate()`. Free list coalesces adjacent blocks for reuse.
    - Decommit is deliberately a no-op: `ELFNixPlatform` deinitializers can
    still reference freed allocations during teardown. Physical pages return
    to the free list instead; all memory is reclaimed by `munmap` in the
    arena destructor.
    
    ### GOTPCRELX fix plugin (`orcjit_session.cc`)
    
    - Works around LLVM JITLink's `optimizeGOTAndStubAccesses()` relaxing
    `call *foo@GOTPCREL(%rip)` → `addr32 call foo` but tagging the edge as
    absolute `Pointer32`. On non-PIE executables with symbols in the low 4
    GB, this produces a garbage displacement → SIGSEGV during ORC-runtime
    teardown.
    - `GOTPCRELXFixPlugin` runs as a `PreFixupPass` after relaxation and
    either converts to `BranchPCRel32` when the displacement fits, or
    reverts the relaxation (restores `ff 15`/`ff 25` opcodes, retargets the
    edge to the GOT entry with `PCRel32`).
    
    ### Configuration
    
    `ExecutionSession(arena_size=...)` / `arena_size_bytes` C++ arg: `0` =
    arch default, `>0` = custom size, `<0` = disable arena. Linux-only;
    ignored on macOS/Windows where the arena is compiled out.
    
    ### Tests (`tests/test_arena.py`)
    
    8 arena tests across C/C++/GCC/PIE variants:
    
    - `test_arena_colocation` — objects stay within a small window.
    - `test_arena_keeps_objects_close` — scatter baseline under VA blocker
    with arena enabled.
    - `test_arena_hidden_symbol_with_blocker` — ADRP/PC32 cross-object calls
    resolve under VA pressure.
    - `test_large_data_section` — 4 MB `.nv_fatbin` loads inside arena when
    references are absolute.
    - `test_overflow_section_outside_arena` — `.nv_fatbin` routed to
    separate mmap, confirmed via address gap.
    - `test_dso_handle_relocation_after_failed_materialization` —
    `__dso_handle` resolves after prior sessions leaked slabs.
    - `test_dso_handle_delta32_with_arena` / `_overflow_without_arena` —
    `-fpie` GCC objects under 3 GB VA blocker: with arena → passes; without
    arena → Delta32 overflow.
    
    All tests use a 16 MB arena and 256 MB–3 GB VA blockers, safe for CI.
    
    ## Test plan
    
    - [x] All orcjit tests pass locally on Linux x86_64 and aarch64
    - [ ] CI green on Linux x86_64, Linux aarch64, macOS arm64, Windows
    AMD64
    - [x] Non-Linux platforms unaffected (arena compiled out under `#ifdef
    __linux__`)
    
    ---------
    
    Co-authored-by: Yaxing Cai <[email protected]>
---
 addons/tvm_ffi_orcjit/CMakeLists.txt               |   5 +-
 .../python/tvm_ffi_orcjit/session.py               |  10 +-
 addons/tvm_ffi_orcjit/src/ffi/orcjit_dylib.cc      |   4 +-
 .../src/ffi/orcjit_memory_manager.cc               | 698 +++++++++++++++++++++
 .../tvm_ffi_orcjit/src/ffi/orcjit_memory_manager.h | 229 +++++++
 addons/tvm_ffi_orcjit/src/ffi/orcjit_session.cc    | 204 +++++-
 addons/tvm_ffi_orcjit/src/ffi/orcjit_session.h     |  11 +-
 .../tvm_ffi_orcjit/tests/sources/c/fake_fatbin.c   |  64 ++
 addons/tvm_ffi_orcjit/tests/sources/c/test_addr.c  |  33 +
 .../tests/sources/c/test_hidden_caller.c           |  61 ++
 .../tests/sources/c/test_hidden_helper.c           |  53 ++
 .../tvm_ffi_orcjit/tests/sources/cc/fake_fatbin.cc |  64 ++
 .../tvm_ffi_orcjit/tests/sources/cc/test_addr.cc   |  26 +
 .../tests/sources/cc/test_hidden_caller.cc         |  52 ++
 .../tests/sources/cc/test_hidden_helper.cc         |  44 ++
 addons/tvm_ffi_orcjit/tests/test_arena.py          | 674 ++++++++++++++++++++
 addons/tvm_ffi_orcjit/tests/utils.py               |  28 +-
 17 files changed, 2235 insertions(+), 25 deletions(-)

diff --git a/addons/tvm_ffi_orcjit/CMakeLists.txt 
b/addons/tvm_ffi_orcjit/CMakeLists.txt
index 5281238..9a86bac 100644
--- a/addons/tvm_ffi_orcjit/CMakeLists.txt
+++ b/addons/tvm_ffi_orcjit/CMakeLists.txt
@@ -37,7 +37,10 @@ execute_process(
 find_package(tvm_ffi CONFIG REQUIRED)
 
 # ---- Build shared library ----
-add_library(tvm_ffi_orcjit SHARED src/ffi/orcjit_session.cc 
src/ffi/orcjit_dylib.cc)
+add_library(
+  tvm_ffi_orcjit SHARED src/ffi/orcjit_session.cc src/ffi/orcjit_dylib.cc
+                        src/ffi/orcjit_memory_manager.cc
+)
 set_target_properties(
   tvm_ffi_orcjit PROPERTIES CXX_VISIBILITY_PRESET hidden 
VISIBILITY_INLINES_HIDDEN ON
 )
diff --git a/addons/tvm_ffi_orcjit/python/tvm_ffi_orcjit/session.py 
b/addons/tvm_ffi_orcjit/python/tvm_ffi_orcjit/session.py
index 66fb1e4..02fccaf 100644
--- a/addons/tvm_ffi_orcjit/python/tvm_ffi_orcjit/session.py
+++ b/addons/tvm_ffi_orcjit/python/tvm_ffi_orcjit/session.py
@@ -60,19 +60,25 @@ class ExecutionSession(Object):
 
     """
 
-    def __init__(self, orc_rt_path: str | None = None) -> None:
+    def __init__(self, orc_rt_path: str | None = None, arena_size: int = 0) -> 
None:
         """Initialize ExecutionSession.
 
         Args:
             orc_rt_path: Optional path to the liborc_rt library. If not 
provided,
                         it will be automatically discovered using clang.
+            arena_size: Arena size in bytes for the JIT memory manager.
+                        Linux only — ignored on macOS and Windows, where the
+                        arena is compiled out.
+                        0 = arch default (4GB x86_64, 8GB AArch64; falls back 
to
+                        smaller sizes under RLIMIT_AS), >0 = custom size,
+                        <0 = disable arena.
 
         """
         if orc_rt_path is None:
             orc_rt_path = _find_orc_rt_library()
             if orc_rt_path is None:
                 orc_rt_path = ""
-        self.__init_handle_by_constructor__(_ffi_api.ExecutionSession, 
orc_rt_path)  # type: ignore
+        self.__init_handle_by_constructor__(_ffi_api.ExecutionSession, 
orc_rt_path, arena_size)  # type: ignore
 
     def create_library(self, name: str = "") -> DynamicLibrary:
         """Create a new dynamic library associated with this execution session.
diff --git a/addons/tvm_ffi_orcjit/src/ffi/orcjit_dylib.cc 
b/addons/tvm_ffi_orcjit/src/ffi/orcjit_dylib.cc
index 82c5290..11a6d52 100644
--- a/addons/tvm_ffi_orcjit/src/ffi/orcjit_dylib.cc
+++ b/addons/tvm_ffi_orcjit/src/ffi/orcjit_dylib.cc
@@ -188,7 +188,9 @@ static void RegisterOrcJITFunctions() {
 
   refl::GlobalDef()
       .def("orcjit.ExecutionSession",
-           [](const std::string& orc_rt_path) { return 
ORCJITExecutionSession(orc_rt_path); })
+           [](const std::string& orc_rt_path, int64_t arena_size_bytes) {
+             return ORCJITExecutionSession(orc_rt_path, arena_size_bytes);
+           })
       .def("orcjit.ExecutionSessionCreateDynamicLibrary",
            [](const ORCJITExecutionSession& session, const String& name) -> 
Module {
              return session->CreateDynamicLibrary(name);
diff --git a/addons/tvm_ffi_orcjit/src/ffi/orcjit_memory_manager.cc 
b/addons/tvm_ffi_orcjit/src/ffi/orcjit_memory_manager.cc
new file mode 100644
index 0000000..26be4b8
--- /dev/null
+++ b/addons/tvm_ffi_orcjit/src/ffi/orcjit_memory_manager.cc
@@ -0,0 +1,698 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+/*!
+ * \file orcjit_memory_manager.cc
+ * \brief Arena-based JITLinkMemoryManager implementation.
+ *
+ * Follows the InProcessMemoryManager pattern from LLVM but replaces
+ * per-object mmap with bump allocation from a pre-reserved arena.
+ * Pages are committed in 2 MB slabs to enable Transparent Huge Page
+ * (THP) promotion — see the class docstring in orcjit_memory_manager.h.
+ */
+#include "orcjit_memory_manager.h"
+
+#ifdef __linux__
+
+#include <llvm/ADT/DenseSet.h>
+#include <llvm/ADT/SmallVector.h>
+#include <llvm/ExecutionEngine/JITLink/JITLink.h>
+#include <llvm/ExecutionEngine/JITLink/aarch64.h>
+#include <llvm/ExecutionEngine/JITLink/x86_64.h>
+#include <llvm/ExecutionEngine/Orc/Shared/AllocationActions.h>
+#include <llvm/Support/Alignment.h>
+#include <llvm/Support/FormatVariadic.h>
+#include <llvm/Support/Memory.h>
+#include <sys/mman.h>
+
+#include <algorithm>
+#include <cerrno>
+#include <cstdio>
+#include <cstring>
+
+namespace tvm {
+namespace ffi {
+namespace orcjit {
+
+using namespace llvm;
+using namespace llvm::jitlink;
+using namespace llvm::orc;
+
+// ── Overflow section edge classification ───────────────────────────
+//
+// Conservative whitelist: only known absolute relocation kinds return true.
+// Unknown or future edge kinds default to PC-relative → sections stay in
+// the arena (safe: never breaks relocations, just forgoes the overflow
+// optimization for unknown kinds).
+
+namespace {
+
+bool isAbsoluteEdge(const Triple& TT, Edge::Kind K) {
+  if (K < Edge::FirstRelocation) return true;  // KeepAlive, Invalid — not a 
relocation constraint
+  if (TT.isAArch64()) {
+    using namespace llvm::jitlink::aarch64;
+    switch (K) {
+      case Pointer64:
+      case Pointer32:
+      case Pointer64Authenticated:
+      case MoveWide16:
+        return true;
+      default:
+        return false;
+    }
+  }
+  if (TT.isX86()) {
+    using namespace llvm::jitlink::x86_64;
+    switch (K) {
+      case Pointer64:
+      case Pointer32:
+      case Pointer32Signed:
+      case Pointer16:
+      case Pointer8:
+      case Size64:
+      case Size32:
+        return true;
+      default:
+        return false;
+    }
+  }
+  return false;  // Unknown arch — treat as PC-relative (safe)
+}
+
+}  // namespace
+
+// ── Platform abstraction ────────────────────────────────────────────
+
+void* ArenaJITLinkMemoryManager::reserveVA(size_t size) {
+  void* p = ::mmap(nullptr, size, PROT_NONE, MAP_PRIVATE | MAP_ANONYMOUS | 
MAP_NORESERVE, -1, 0);
+  if (p == MAP_FAILED) return nullptr;
+  return p;
+}
+
+void ArenaJITLinkMemoryManager::releaseVA(void* addr, size_t size) {
+  int rc = ::munmap(addr, size);
+  assert(rc == 0 && "munmap failed in arena destructor");
+  (void)rc;
+}
+
+Error ArenaJITLinkMemoryManager::commitPages(void* addr, size_t size) {
+  if (size == 0) return Error::success();
+  // Commit at slab (2 MB) granularity for THP promotion.
+  size_t offset = static_cast<char*>(addr) - arena_base_;
+  size_t first_slab = offset / kSlabSize;
+  size_t last_slab = (offset + size - 1) / kSlabSize;
+
+  for (size_t i = first_slab; i <= last_slab; ++i) {
+    if (slab_committed_[i].load(std::memory_order_acquire) != 0) continue;
+    size_t slab_offset = i * kSlabSize;
+    size_t slab_len = std::min(kSlabSize, arena_capacity_ - slab_offset);
+    // mprotect is idempotent, so a concurrent racer calling it on the same 
slab
+    // is harmless.  Only flip the flag after success — otherwise a failed 
commit
+    // followed by freeRegion() would leave slab_committed_[i] == 1, causing a
+    // later allocation to skip mprotect and write into PROT_NONE memory.
+    if (::mprotect(arena_base_ + slab_offset, slab_len, PROT_READ | 
PROT_WRITE) != 0) {
+      return make_error<StringError>(
+          "ArenaJITLinkMemoryManager: mprotect(RW) failed for slab at offset " 
+
+              formatv("{0:x}", slab_offset) + ": " + std::strerror(errno),
+          inconvertibleErrorCode());
+    }
+    slab_committed_[i].store(1, std::memory_order_release);
+  }
+  return Error::success();
+}
+
+void ArenaJITLinkMemoryManager::decommitPages(void* addr, size_t size) {
+  // Intentionally a no-op for arena pages.  The ORC runtime may still 
reference
+  // deallocated JIT memory during session teardown (e.g., ELFNixPlatform
+  // deinitializers run after some allocations are freed).  Decommitting
+  // (MADV_DONTNEED or mprotect PROT_NONE) would cause segfaults or illegal
+  // instructions during shutdown.
+  //
+  // Physical pages stay committed but are returned to the free list for reuse.
+  // The arena destructor releases all VA and physical memory via munmap.
+  (void)addr;
+  (void)size;
+}
+
+Error ArenaJITLinkMemoryManager::protectPages(void* addr, size_t size, MemProt 
Prot) {
+  int prot = PROT_NONE;
+  if ((Prot & MemProt::Read) != MemProt::None) prot |= PROT_READ;
+  if ((Prot & MemProt::Write) != MemProt::None) prot |= PROT_WRITE;
+  if ((Prot & MemProt::Exec) != MemProt::None) prot |= PROT_EXEC;
+  if (::mprotect(addr, size, prot) != 0) {
+    return make_error<StringError>("ArenaJITLinkMemoryManager: mprotect failed 
at " +
+                                       formatv("{0:x}", addr) + " size " + 
formatv("{0:x}", size) +
+                                       ": " + std::strerror(errno),
+                                   inconvertibleErrorCode());
+  }
+  if ((Prot & MemProt::Exec) != MemProt::None) {
+    sys::Memory::InvalidateInstructionCache(addr, size);
+  }
+  return Error::success();
+}
+
+// ── ArenaInFlightAlloc ──────────────────────────────────────────────
+
+class ArenaJITLinkMemoryManager::ArenaInFlightAlloc : public 
JITLinkMemoryManager::InFlightAlloc {
+ public:
+  // A contiguous region within one pool: [offset, offset + standard_size + 
finalize_size).
+  // Standard-lifetime bytes come first; Finalize-lifetime bytes follow and 
are freed
+  // at the end of finalize().  Any field may be 0 to indicate no allocation 
from
+  // that pool on this call.
+  struct PoolRegion {
+    size_t offset;
+    size_t standard_size;
+    size_t finalize_size;
+  };
+
+  ArenaInFlightAlloc(ArenaJITLinkMemoryManager& MM, LinkGraph& G, BasicLayout 
BL,
+                     PoolRegion non_exec, PoolRegion exec,
+                     std::vector<OverflowBlock> overflow_blocks)
+      : MM(MM),
+        G(&G),
+        BL(std::move(BL)),
+        non_exec_(non_exec),
+        exec_(exec),
+        overflow_blocks_(std::move(overflow_blocks)) {}
+
+  ~ArenaInFlightAlloc() override {
+    assert(!G && "ArenaInFlightAlloc destroyed without finalize or abandon");
+  }
+
+  void finalize(OnFinalizedFunction OnFinalized) override {
+    // Apply target protections for each arena segment.
+    if (auto Err = applyProtections()) {
+      OnFinalized(std::move(Err));
+      return;
+    }
+
+    // Apply target protections for overflow blocks.
+    for (auto& ob : overflow_blocks_) {
+      if (auto Err = MM.protectPages(ob.addr, ob.size, ob.prot)) {
+        OnFinalized(std::move(Err));
+        return;
+      }
+    }
+
+    // Run finalization actions (e.g., register EH frames).
+    auto DeallocActions = shared::runFinalizeActions(BL.graphAllocActions());
+    if (!DeallocActions) {
+      OnFinalized(DeallocActions.takeError());
+      return;
+    }
+
+    // Decommit finalize-lifetime pages in each pool — they're no longer 
needed.
+    for (auto& R : {non_exec_, exec_}) {
+      if (R.finalize_size > 0) {
+        MM.decommitPages(MM.arena_base_ + R.offset + R.standard_size, 
R.finalize_size);
+        MM.freeRegion(R.offset + R.standard_size, R.finalize_size);
+      }
+    }
+
+#ifndef NDEBUG
+    G = nullptr;
+#endif
+
+    // Create finalized allocation handle.  LLVM's FinalizedAlloc stores an
+    // opaque ExecutorAddr (integer), so we must use raw new here.  Ownership
+    // transfers to deallocate(), which LLVM guarantees is called for every
+    // finalized allocation.
+    auto* FA = new FinalizedAllocInfo{
+        non_exec_.offset,    non_exec_.standard_size,    exec_.offset,
+        exec_.standard_size, std::move(*DeallocActions), 
std::move(overflow_blocks_)};
+    OnFinalized(FinalizedAlloc(ExecutorAddr::fromPtr(FA)));
+  }
+
+  void abandon(OnAbandonedFunction OnAbandoned) override {
+    // Decommit and return each pool's full region to the appropriate free 
list.
+    for (auto& R : {non_exec_, exec_}) {
+      size_t total = R.standard_size + R.finalize_size;
+      if (total > 0) {
+        MM.decommitPages(MM.arena_base_ + R.offset, total);
+        MM.freeRegion(R.offset, total);
+      }
+    }
+
+    // Release overflow blocks.
+    for (auto& ob : overflow_blocks_) {
+      ::munmap(ob.addr, ob.size);
+    }
+
+#ifndef NDEBUG
+    G = nullptr;
+#endif
+
+    OnAbandoned(Error::success());
+  }
+
+ private:
+  Error applyProtections() {
+    for (auto& KV : BL.segments()) {
+      const auto& AG = KV.first;
+      auto& Seg = KV.second;
+
+      auto SegSize = alignTo(Seg.ContentSize + Seg.ZeroFillSize, 
MM.page_size_);
+      if (auto Err = MM.protectPages(Seg.WorkingMem, SegSize, 
AG.getMemProt())) return Err;
+    }
+    return Error::success();
+  }
+
+  ArenaJITLinkMemoryManager& MM;
+  LinkGraph* G;
+  BasicLayout BL;
+  PoolRegion non_exec_;
+  PoolRegion exec_;
+  std::vector<OverflowBlock> overflow_blocks_;
+};
+
+// ── ArenaJITLinkMemoryManager ───────────────────────────────────────
+
+ArenaJITLinkMemoryManager::ArenaJITLinkMemoryManager(size_t page_size, size_t 
arena_capacity)
+    : arena_base_(nullptr),
+      arena_capacity_(arena_capacity),
+      page_size_(page_size),
+      midpoint_(0),
+      exec_bump_limit_(0),
+      non_exec_bump_(0),
+      exec_bump_(0) {
+  // Try requested capacity, halve on failure down to a minimum floor.
+  // The floor is the smaller of kMinArenaCapacity and the requested size,
+  // so explicit small arenas (e.g. 16 MB for tests) are honoured.
+  // mmap(PROT_NONE | MAP_NORESERVE) can still fail under RLIMIT_AS or
+  // extreme VA fragmentation.
+  size_t floor = std::min(arena_capacity_, kMinArenaCapacity);
+  size_t cap = arena_capacity_;
+  while (cap >= floor) {
+    arena_base_ = static_cast<char*>(reserveVA(cap));
+    if (arena_base_) {
+      arena_capacity_ = cap;
+      // Partition the arena into two pools at a 2 MB-aligned midpoint.
+      // The exec pool starts at midpoint_, which is therefore on a 2 MB
+      // boundary — r-x segments pack into a minimum number of 2 MB pages.
+      //
+      // Constraint: cross-pool displacements (e.g. .text → .rodata via
+      // ADRP+ADD on aarch64) must fit in ±kPCRelReach.  The farthest pair
+      // of bytes is (end of exec, start of non-exec), separated by at most
+      // `exec_bump_limit_`, so we cap the exec pool's upper bound at
+      // kPCRelReach even when the VA reservation is larger.
+      exec_bump_limit_ = std::min(cap, kPCRelReach);
+      size_t raw_midpoint = static_cast<size_t>(exec_bump_limit_ * 
kDefaultNonExecFraction);
+      midpoint_ = (raw_midpoint / kSlabSize) * kSlabSize;
+      if (midpoint_ == 0) midpoint_ = kSlabSize;
+      if (midpoint_ >= exec_bump_limit_) midpoint_ = exec_bump_limit_ - 
kSlabSize;
+      non_exec_bump_ = 0;
+      exec_bump_ = midpoint_;
+      // Initialize slab commit tracking.  make_unique<T[]>(n) 
value-initializes
+      // the array to zero in C++17.
+      num_slabs_ = (cap + kSlabSize - 1) / kSlabSize;
+      slab_committed_ = std::make_unique<std::atomic<uint8_t>[]>(num_slabs_);
+      // Hint THP promotion for the entire arena.  Intentionally unchecked —
+      // MADV_HUGEPAGE is advisory and may fail if THP is disabled system-wide.
+      (void)::madvise(arena_base_, cap, MADV_HUGEPAGE);
+      return;
+    }
+    cap /= 2;
+  }
+  llvm::report_fatal_error("ArenaJITLinkMemoryManager: failed to reserve at 
least " +
+                           Twine(floor / (1024 * 1024)) + " MB of virtual 
address space");
+}
+
+ArenaJITLinkMemoryManager::~ArenaJITLinkMemoryManager() {
+  if (arena_base_) {
+    releaseVA(arena_base_, arena_capacity_);
+  }
+}
+
+Expected<size_t> ArenaJITLinkMemoryManager::bumpAllocate(size_t size, bool 
is_exec) {
+  std::lock_guard<std::mutex> Lock(mu_);
+
+  auto& free_list = is_exec ? free_list_exec_ : free_list_non_exec_;
+  auto& bump = is_exec ? exec_bump_ : non_exec_bump_;
+  size_t limit = is_exec ? exec_bump_limit_ : midpoint_;
+
+  // Try free list first (best-fit).  O(n) scan — acceptable for the expected
+  // workload of tens of JIT allocations, not thousands.
+  size_t best_idx = free_list.size();
+  size_t best_waste = std::numeric_limits<size_t>::max();
+  for (size_t i = 0; i < free_list.size(); ++i) {
+    if (free_list[i].size >= size && free_list[i].size - size < best_waste) {
+      best_idx = i;
+      best_waste = free_list[i].size - size;
+      if (best_waste == 0) break;
+    }
+  }
+
+  if (best_idx < free_list.size()) {
+    size_t offset = free_list[best_idx].offset;
+    if (free_list[best_idx].size == size) {
+      free_list.erase(free_list.begin() + best_idx);
+    } else {
+      free_list[best_idx].offset += size;
+      free_list[best_idx].size -= size;
+    }
+    return offset;
+  }
+
+  // Bump allocate within the pool's limit.
+  if (bump + size > limit) {
+    return make_error<StringError>(
+        std::string("ArenaJITLinkMemoryManager: ") + (is_exec ? "exec" : 
"non-exec") +
+            " pool exhausted (used " + formatv("{0:x}", bump).str() + " + 
requested " +
+            formatv("{0:x}", size).str() + " > limit " + formatv("{0:x}", 
limit).str() + ")",
+        inconvertibleErrorCode());
+  }
+
+  size_t offset = bump;
+  bump += size;
+  return offset;
+}
+
+void ArenaJITLinkMemoryManager::freeRegion(size_t offset, size_t size) {
+  if (size == 0) return;
+  std::lock_guard<std::mutex> Lock(mu_);
+
+  // Route to the correct pool's free list based on offset.
+  auto& free_list = (offset >= midpoint_) ? free_list_exec_ : 
free_list_non_exec_;
+
+  // Insert into free list in sorted order.
+  auto it = std::lower_bound(free_list.begin(), free_list.end(), offset,
+                             [](const FreeBlock& fb, size_t off) { return 
fb.offset < off; });
+  it = free_list.insert(it, FreeBlock{offset, size});
+
+  // Coalesce with next.
+  auto next = it + 1;
+  if (next != free_list.end() && it->offset + it->size == next->offset) {
+    it->size += next->size;
+    free_list.erase(next);
+  }
+
+  // Coalesce with previous.
+  if (it != free_list.begin()) {
+    auto prev = it - 1;
+    if (prev->offset + prev->size == it->offset) {
+      prev->size += it->size;
+      free_list.erase(it);
+    }
+  }
+}
+
+void ArenaJITLinkMemoryManager::allocate(const JITLinkDylib* JD, LinkGraph& G,
+                                         OnAllocatedFunction OnAllocated) {
+  // ── Overflow section classification ──
+  //
+  // Sections matching known overflow names (e.g. .nv_fatbin — large GPU
+  // device blobs referenced only by absolute relocations) are allocated
+  // outside the arena via separate mmap(), keeping the arena compact for
+  // code + small rodata.
+  //
+  // Two-phase check:
+  //   Phase 1 — Name-based candidate selection (.nv_fatbin).
+  //   Phase 2 — Edge validation: any PC-relative cross-section edge
+  //             targeting a candidate section disqualifies it (the
+  //             section stays in the arena).  This handles cases where
+  //             the compiler generates ADRP/RIP-relative refs even for
+  //             data sections.
+  //
+  // Validated candidates are temporarily set to NoAlloc so BasicLayout
+  // skips them, then immediately restored before returning.  By the time
+  // JITLink's fixUpBlocks runs, sections are back to Standard — avoiding
+  // the debug assert that prohibits edges from allocated sections to
+  // NoAlloc sections.
+  DenseSet<Section*> overflow_candidates;
+  for (auto& Sec : G.sections()) {
+    if (Sec.getMemLifetime() == MemLifetime::NoAlloc) continue;
+    StringRef Name = Sec.getName();
+    if (Name.starts_with(".nv_fatbin")) {
+      overflow_candidates.insert(&Sec);
+    }
+  }
+
+  // Phase 2: edge validation — disqualify candidates with incoming 
PC-relative edges.
+  if (!overflow_candidates.empty()) {
+    const auto& TT = G.getTargetTriple();
+    for (auto& Sec : G.sections()) {
+      for (auto* B : Sec.blocks()) {
+        for (auto& E : B->edges()) {
+          if (!E.isRelocation()) continue;
+          if (isAbsoluteEdge(TT, E.getKind())) continue;
+          // PC-relative edge — if it targets a candidate, disqualify.
+          if (!E.getTarget().isDefined()) continue;
+          auto* TargetSec = &E.getTarget().getBlock().getSection();
+          overflow_candidates.erase(TargetSec);
+        }
+      }
+      if (overflow_candidates.empty()) break;
+    }
+  }
+
+  // Apply: temporarily hide validated overflow sections from BasicLayout.
+  SmallVector<std::pair<Section*, MemLifetime>, 4> overflow_sections;
+  for (auto* Sec : overflow_candidates) {
+    overflow_sections.push_back({Sec, Sec->getMemLifetime()});
+    Sec->setMemLifetime(MemLifetime::NoAlloc);
+  }
+
+  BasicLayout BL(G);
+
+  // Restore overflow sections to their original lifetime immediately.
+  // BasicLayout has already captured its segment list; subsequent LLVM
+  // passes (fixUpBlocks) will see the sections as normal Standard sections.
+  for (auto& [Sec, OrigLifetime] : overflow_sections) {
+    Sec->setMemLifetime(OrigLifetime);
+  }
+
+  // Compute total sizes grouped by lifetime.
+  auto SegsSizes = BL.getContiguousPageBasedLayoutSizes(page_size_);
+  if (!SegsSizes) {
+    OnAllocated(SegsSizes.takeError());
+    return;
+  }
+
+  if (SegsSizes->total() > std::numeric_limits<size_t>::max()) {
+    OnAllocated(make_error<llvm::jitlink::JITLinkError>(
+        "Total requested size " + formatv("{0:x}", SegsSizes->total()) + " for 
graph " +
+        G.getName() + " exceeds address space"));
+    return;
+  }
+
+  auto TotalSize = static_cast<size_t>(SegsSizes->total());
+  if (TotalSize == 0 && overflow_sections.empty()) {
+    // Empty graph — return a no-op allocation.
+    OnAllocated(std::make_unique<ArenaInFlightAlloc>(
+        *this, G, std::move(BL), ArenaInFlightAlloc::PoolRegion{0, 0, 0},
+        ArenaInFlightAlloc::PoolRegion{midpoint_, 0, 0}, 
std::vector<OverflowBlock>{}));
+    return;
+  }
+
+  // ── Dual-pool split ──
+  //
+  // Partition each segment into one of four buckets based on (Prot, Lifetime):
+  //   non-exec × Standard / Finalize   →  non-exec pool (below midpoint_)
+  //   exec     × Standard / Finalize   →  exec pool     (at/above midpoint_)
+  //
+  // Within each pool, Standard segments come first and Finalize segments
+  // second, so the Finalize tail of each pool can be freed after finalize().
+  size_t ne_std_size = 0, ne_fin_size = 0;
+  size_t e_std_size = 0, e_fin_size = 0;
+  for (auto& KV : BL.segments()) {
+    auto& AG = KV.first;
+    auto& Seg = KV.second;
+    auto SegSize = alignTo(Seg.ContentSize + Seg.ZeroFillSize, page_size_);
+    bool is_exec = (AG.getMemProt() & MemProt::Exec) != MemProt::None;
+    bool is_finalize = AG.getMemLifetime() == MemLifetime::Finalize;
+    if (is_exec) {
+      (is_finalize ? e_fin_size : e_std_size) += SegSize;
+    } else {
+      (is_finalize ? ne_fin_size : ne_std_size) += SegSize;
+    }
+  }
+  size_t ne_total = ne_std_size + ne_fin_size;
+  size_t e_total = e_std_size + e_fin_size;
+
+  ArenaInFlightAlloc::PoolRegion ne_region{0, 0, 0};
+  ArenaInFlightAlloc::PoolRegion e_region{midpoint_, 0, 0};
+
+  auto allocPool = [&](size_t req, bool is_exec) -> Expected<size_t> {
+    if (req == 0) return size_t{0};
+    auto off = bumpAllocate(req, is_exec);
+    if (!off) return off.takeError();
+    if (auto Err = commitPages(arena_base_ + *off, req)) {
+      freeRegion(*off, req);
+      return std::move(Err);
+    }
+    std::memset(arena_base_ + *off, 0, req);
+    return *off;
+  };
+
+  if (ne_total > 0) {
+    auto off = allocPool(ne_total, /*is_exec=*/false);
+    if (!off) {
+      OnAllocated(off.takeError());
+      return;
+    }
+    ne_region = {*off, ne_std_size, ne_fin_size};
+  }
+  if (e_total > 0) {
+    auto off = allocPool(e_total, /*is_exec=*/true);
+    if (!off) {
+      // Unwind non-exec allocation on failure to keep the pools consistent.
+      if (ne_total > 0) {
+        decommitPages(arena_base_ + ne_region.offset, ne_total);
+        freeRegion(ne_region.offset, ne_total);
+      }
+      OnAllocated(off.takeError());
+      return;
+    }
+    e_region = {*off, e_std_size, e_fin_size};
+  }
+
+  // Assign addresses to segments from four cursors.  Standard comes first in
+  // each pool, then Finalize.
+  auto NeStdCursor = ExecutorAddr::fromPtr(arena_base_ + ne_region.offset);
+  auto NeFinCursor = ExecutorAddr::fromPtr(arena_base_ + ne_region.offset + 
ne_std_size);
+  auto EStdCursor = ExecutorAddr::fromPtr(arena_base_ + e_region.offset);
+  auto EFinCursor = ExecutorAddr::fromPtr(arena_base_ + e_region.offset + 
e_std_size);
+
+  for (auto& KV : BL.segments()) {
+    auto& AG = KV.first;
+    auto& Seg = KV.second;
+    bool is_exec = (AG.getMemProt() & MemProt::Exec) != MemProt::None;
+    bool is_finalize = AG.getMemLifetime() == MemLifetime::Finalize;
+    auto& Cursor = is_exec ? (is_finalize ? EFinCursor : EStdCursor)
+                           : (is_finalize ? NeFinCursor : NeStdCursor);
+    Seg.WorkingMem = Cursor.toPtr<char*>();
+    Seg.Addr = Cursor;
+    auto SegSize = alignTo(Seg.ContentSize + Seg.ZeroFillSize, page_size_);
+    Cursor += SegSize;
+  }
+
+  // Apply layout — copies content and assigns block addresses for arena 
segments.
+  if (auto Err = BL.apply()) {
+    // On error: decommit and free both pool regions.
+    if (ne_total > 0) {
+      decommitPages(arena_base_ + ne_region.offset, ne_total);
+      freeRegion(ne_region.offset, ne_total);
+    }
+    if (e_total > 0) {
+      decommitPages(arena_base_ + e_region.offset, e_total);
+      freeRegion(e_region.offset, e_total);
+    }
+    OnAllocated(std::move(Err));
+    return;
+  }
+
+  // ── Allocate overflow sections via mmap() outside the arena ──
+  std::vector<OverflowBlock> overflow_allocs;
+
+  for (auto& [Sec, _] : overflow_sections) {
+    // Compute total size for this section's blocks.
+    size_t total_sec_size = 0;
+    for (auto* B : Sec->blocks()) {
+      total_sec_size = alignTo(total_sec_size, B->getAlignment());
+      total_sec_size += B->getSize();
+    }
+    if (total_sec_size == 0) continue;
+    total_sec_size = alignTo(total_sec_size, page_size_);
+
+    // mmap outside the arena.
+    void* addr =
+        ::mmap(nullptr, total_sec_size, PROT_READ | PROT_WRITE, MAP_PRIVATE | 
MAP_ANONYMOUS, -1, 0);
+    if (addr == MAP_FAILED) {
+      // Clean up prior overflow allocs, free both pool regions, report error.
+      for (auto& ob : overflow_allocs) ::munmap(ob.addr, ob.size);
+      if (ne_total > 0) {
+        decommitPages(arena_base_ + ne_region.offset, ne_total);
+        freeRegion(ne_region.offset, ne_total);
+      }
+      if (e_total > 0) {
+        decommitPages(arena_base_ + e_region.offset, e_total);
+        freeRegion(e_region.offset, e_total);
+      }
+      OnAllocated(
+          make_error<StringError>("ArenaJITLinkMemoryManager: overflow mmap 
failed for section " +
+                                      Sec->getName() + ": " + 
std::strerror(errno),
+                                  inconvertibleErrorCode()));
+      return;
+    }
+
+    // Layout blocks within the mmap'd region.
+    char* ptr = static_cast<char*>(addr);
+    for (auto* B : Sec->blocks()) {
+      uint64_t align = B->getAlignment();
+      ptr = reinterpret_cast<char*>(alignTo(reinterpret_cast<uintptr_t>(ptr), 
align));
+      size_t bsize = B->getSize();
+      // Copy content and redirect block's mutable content pointer.
+      if (!B->isZeroFill()) {
+        auto content = B->getContent();
+        std::memcpy(ptr, content.data(), content.size());
+        B->setMutableContent(MutableArrayRef<char>(ptr, bsize));
+      }
+      // Assign block address (working mem == executor addr for in-process 
JIT).
+      B->setAddress(ExecutorAddr::fromPtr(ptr));
+      ptr += bsize;
+    }
+
+    overflow_allocs.push_back({addr, total_sec_size, Sec->getMemProt()});
+  }
+
+  OnAllocated(std::make_unique<ArenaInFlightAlloc>(*this, G, std::move(BL), 
ne_region, e_region,
+                                                   
std::move(overflow_allocs)));
+}
+
+void ArenaJITLinkMemoryManager::deallocate(std::vector<FinalizedAlloc> Allocs,
+                                           OnDeallocatedFunction 
OnDeallocated) {
+  Error DeallocErr = Error::success();
+
+  for (auto& Alloc : Allocs) {
+    // Reclaim ownership of the FinalizedAllocInfo created in finalize().
+    auto* FA = Alloc.release().toPtr<FinalizedAllocInfo*>();
+
+    // Run deallocation actions in reverse order.
+    while (!FA->DeallocActions.empty()) {
+      if (auto Err = FA->DeallocActions.back().runWithSPSRetErrorMerged())
+        DeallocErr = joinErrors(std::move(DeallocErr), std::move(Err));
+      FA->DeallocActions.pop_back();
+    }
+
+    // Decommit and free each pool's Standard region.
+    if (FA->non_exec_standard_size > 0) {
+      decommitPages(arena_base_ + FA->non_exec_offset, 
FA->non_exec_standard_size);
+      freeRegion(FA->non_exec_offset, FA->non_exec_standard_size);
+    }
+    if (FA->exec_standard_size > 0) {
+      decommitPages(arena_base_ + FA->exec_offset, FA->exec_standard_size);
+      freeRegion(FA->exec_offset, FA->exec_standard_size);
+    }
+
+    // Release overflow blocks.
+    for (auto& ob : FA->overflow_blocks) {
+      ::munmap(ob.addr, ob.size);
+    }
+
+    delete FA;
+  }
+
+  OnDeallocated(std::move(DeallocErr));
+}
+
+}  // namespace orcjit
+}  // namespace ffi
+}  // namespace tvm
+
+#endif  // __linux__
diff --git a/addons/tvm_ffi_orcjit/src/ffi/orcjit_memory_manager.h 
b/addons/tvm_ffi_orcjit/src/ffi/orcjit_memory_manager.h
new file mode 100644
index 0000000..8f83c38
--- /dev/null
+++ b/addons/tvm_ffi_orcjit/src/ffi/orcjit_memory_manager.h
@@ -0,0 +1,229 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+/*!
+ * \file orcjit_memory_manager.h
+ * \brief Arena-based JITLinkMemoryManager for LLVM ORC JIT.
+ *
+ * Pre-reserves a contiguous virtual address region and bump-allocates
+ * from it, keeping all JIT allocations within range of PC-relative
+ * relocations (±2GB on x86_64, ±4GB on AArch64).
+ *
+ * This eliminates relocation overflow caused by scattered mmap
+ * allocations under ASLR (LLVM issue #173269).
+ *
+ * ## GOTPCRELX relaxation workaround
+ *
+ * The arena triggers a latent bug in LLVM JITLink's
+ * `optimizeGOTAndStubAccesses()` (x86_64.cpp).  That pass relaxes
+ * `call *foo@GOTPCREL(%rip)` (ff 15) → `addr32 call foo` (67 e8) and
+ * sets the edge kind to `Pointer32` (absolute 32-bit address).  However
+ * the `call rel32` instruction is always **PC-relative** — the `67`
+ * prefix is just padding — so the fixup should be PC-relative too
+ * (matching the static linker's `R_X86_64_PC32`).
+ *
+ * The bug is latent because the relaxation only fires when the target
+ * address fits in 32 bits (`isUInt<32>`).  On PIE executables every
+ * resolved symbol is at a high address, so the guard is never true and
+ * the relaxation never runs.  On **non-PIE** executables the PLT
+ * entries for libc functions (malloc, free, …) live near 0x400000, the
+ * guard passes, and the wrong fixup produces a garbage displacement →
+ * SIGSEGV during ORC-runtime teardown.
+ *
+ * `GOTPCRELXFixPlugin` in orcjit_session.cc works around this: a
+ * PreFixupPass that runs *after* `optimizeGOTAndStubAccesses` detects
+ * `Pointer32` edges on `67 e8` / `e9` instructions and either
+ *   (a) converts to `BranchPCRel32` when the PC-relative displacement
+ *       fits in int32, or
+ *   (b) reverts the relaxation entirely — restores the `ff 15` /
+ *       `ff 25` opcode bytes and retargets the edge to the GOT entry
+ *       with `PCRel32` + addend 0.
+ */
+#ifndef TVM_FFI_ORCJIT_ORCJIT_MEMORY_MANAGER_H_
+#define TVM_FFI_ORCJIT_ORCJIT_MEMORY_MANAGER_H_
+
+#include <llvm/ExecutionEngine/JITLink/JITLinkMemoryManager.h>
+#include <llvm/ExecutionEngine/Orc/Shared/MemoryFlags.h>
+
+#include <atomic>
+#include <memory>
+#include <mutex>
+#include <vector>
+
+namespace tvm {
+namespace ffi {
+namespace orcjit {
+
+/*! \brief Arena-based memory manager for JITLink.
+ *
+ * Reserves a large contiguous VA region at construction time using
+ * PROT_NONE (zero physical memory cost).  Each allocate() call
+ * bump-allocates from this region, commits pages as RW, and assigns
+ * addresses.  On finalization, pages receive their target protections.
+ * On deallocation, pages are decommitted and returned to a free list.
+ *
+ * The default arena size is strictly larger than the architecture's
+ * PC-relative relocation limit (4 GB on x86_64, 8 GB on AArch64) so
+ * the arena is never the bottleneck — JITLink's own relocation overflow
+ * checker fires first, matching dlopen/ld.so failure semantics.  If the
+ * initial reservation fails (RLIMIT_AS, container limits), the
+ * constructor halves the capacity down to kMinArenaCapacity (256 MB).
+ *
+ * ## Slab-based commit with Transparent Huge Page (THP) support
+ *
+ * Arena pages are committed in 2 MB slabs (kSlabSize) rather than
+ * per-allocation.  Each slab is committed exactly once via an atomic
+ * flag (lock-free, no contention with the allocator mutex).
+ *
+ * Benefits:
+ *   - Batches up to 512 page faults into a single sequential mprotect
+ *     per slab, reducing kernel trap overhead.
+ *   - 2 MB matches the Linux huge page size on both x86_64 and AArch64.
+ *     Combined with madvise(MADV_HUGEPAGE) applied at construction, the
+ *     kernel can promote each fully-faulted slab into a single TLB
+ *     entry (replacing 512 x 4 KB entries), reducing TLB misses during
+ *     JIT code execution.
+ *   - Worst-case waste is <2 MB in the last partially-used slab —
+ *     negligible for typical ML workloads.
+ */
+class ArenaJITLinkMemoryManager : public llvm::jitlink::JITLinkMemoryManager {
+ public:
+  // Default arena: strictly larger than the relocation limit so the arena
+  // is never the bottleneck.  JITLink's own overflow check fires first,
+  // matching dlopen/ld.so failure semantics.
+  //
+  // x86_64 PC32: ±2GB  →  4GB default (2× headroom)
+  // AArch64 ADRP: ±4GB →  8GB default (2× headroom)
+  static constexpr size_t kDefaultArenaCapacity_x86_64 = size_t{4} << 30;   // 
4 GB
+  static constexpr size_t kDefaultArenaCapacity_AArch64 = size_t{8} << 30;  // 
8 GB
+  static constexpr size_t kMinArenaCapacity = size_t{256} << 20;            // 
256 MB floor
+  // Slab commit granularity.  Matches Linux huge page size (2 MB) on both
+  // x86_64 and AArch64, enabling THP promotion via madvise(MADV_HUGEPAGE).
+  static constexpr size_t kSlabSize = size_t{2} << 20;  // 2 MB
+  // PC-relative relocation reach (tightest binding fixup).  Cross-pool
+  // references (.text → .rodata, .eh_frame → .text, etc.) must fit within
+  // this signed displacement.  The binding constraint on both x86_64 and
+  // aarch64 is the signed 32-bit Delta32 used in .eh_frame unwind records
+  // (±2 GB), not the wider ADRP+ADD / RIP-rel reach.  The dual-pool allocator
+  // keeps both pools inside kPCRelReach bytes of each other even when the VA
+  // reservation is larger, so cross-pool Delta32 fixups always resolve.
+  static constexpr size_t kPCRelReach = (size_t{1} << 31) - kSlabSize;  // ~2 
GB
+
+  // Default fraction of the arena reserved for non-exec segments (r--, rw-).
+  // The remainder holds exec segments (r-x).  Picked by splitting the arena
+  // at a 2 MB-aligned boundary (midpoint_); the exec pool thus starts on a
+  // 2 MB page boundary, maximizing r-x page packing.
+  // Typical CUDA binding objects: ~2 parts rodata+data to 1 part text.
+  static constexpr double kDefaultNonExecFraction = 2.0 / 3.0;
+
+  explicit ArenaJITLinkMemoryManager(size_t page_size, size_t arena_capacity);
+  ~ArenaJITLinkMemoryManager() override;
+
+  ArenaJITLinkMemoryManager(const ArenaJITLinkMemoryManager&) = delete;
+  ArenaJITLinkMemoryManager& operator=(const ArenaJITLinkMemoryManager&) = 
delete;
+  ArenaJITLinkMemoryManager(ArenaJITLinkMemoryManager&&) = delete;
+  ArenaJITLinkMemoryManager& operator=(ArenaJITLinkMemoryManager&&) = delete;
+
+  void allocate(const llvm::jitlink::JITLinkDylib* JD, 
llvm::jitlink::LinkGraph& G,
+                OnAllocatedFunction OnAllocated) override;
+
+  void deallocate(std::vector<FinalizedAlloc> Allocs, OnDeallocatedFunction 
OnDeallocated) override;
+
+ private:
+  class ArenaInFlightAlloc;
+
+  /*! \brief A section allocated outside the arena via separate mmap().
+   *
+   *  Sections whose only cross-section references use absolute relocations
+   *  (e.g. .nv_fatbin) are placed here to keep the arena compact. */
+  struct OverflowBlock {
+    void* addr;               // mmap'd base address
+    size_t size;              // mmap'd size (page-aligned)
+    llvm::orc::MemProt prot;  // target protection for finalize
+  };
+
+  /*! \brief Metadata for a finalized allocation, stored via FinalizedAlloc 
handle.
+   *
+   *  The arena is split into two pools at midpoint_.  Each allocate() call may
+   *  consume a region from either or both pools.  Standard-lifetime pages 
remain
+   *  committed after finalize(); Finalize-lifetime pages are decommitted at 
the
+   *  end of finalize().  Zero-sized sub-regions indicate no allocation from 
that
+   *  pool. */
+  struct FinalizedAllocInfo {
+    size_t non_exec_offset;         // offset of non-exec Standard region (or 
0 if unused)
+    size_t non_exec_standard_size;  // bytes retained in non-exec pool after 
finalize
+    size_t exec_offset;             // offset of exec Standard region (or 
midpoint_ if unused)
+    size_t exec_standard_size;      // bytes retained in exec pool after 
finalize
+    std::vector<llvm::orc::shared::WrapperFunctionCall> DeallocActions;
+    std::vector<OverflowBlock> overflow_blocks;
+  };
+
+  /*! \brief Bump-allocate from the selected pool.  Returns offset within 
arena. */
+  llvm::Expected<size_t> bumpAllocate(size_t size, bool is_exec);
+
+  /*! \brief Return a region to the appropriate free list (coalesces adjacent 
blocks).
+   *         Pool is identified by comparing offset against midpoint_. */
+  void freeRegion(size_t offset, size_t size);
+
+  // ── Platform abstraction ──
+  static void* reserveVA(size_t size);
+  static void releaseVA(void* addr, size_t size);
+  llvm::Error commitPages(void* addr, size_t size);
+  static void decommitPages(void* addr, size_t size);
+  static llvm::Error protectPages(void* addr, size_t size, llvm::orc::MemProt 
Prot);
+
+  char* arena_base_;
+  size_t arena_capacity_;
+  size_t page_size_;
+
+  // ── Dual-pool split ──
+  // The arena is partitioned at midpoint_ (a 2 MB-aligned offset) into:
+  //   non-exec pool  = [arena_base_,           arena_base_ + midpoint_        
 )
+  //   exec pool      = [arena_base_ + midpoint_, arena_base_ + 
exec_bump_limit_)
+  // Both pools grow upward from their base.  The exec pool starts on a 2 MB
+  // boundary so r-x segments can pack as tightly as possible into 2 MB pages.
+  //
+  // exec_bump_limit_ = min(arena_capacity_, kPCRelReach).  Bytes beyond this
+  // limit stay reserved (VA only, no commit) but are not used for allocation
+  // so cross-pool references always fit within the PC-relative reach.
+  size_t midpoint_;
+  size_t exec_bump_limit_;
+
+  std::mutex mu_;
+  size_t non_exec_bump_;  // next free offset in non-exec pool ∈ [0, midpoint_]
+  size_t exec_bump_;      // next free offset in exec pool     ∈ [midpoint_, 
arena_capacity_]
+
+  struct FreeBlock {
+    size_t offset;
+    size_t size;
+  };
+  std::vector<FreeBlock> free_list_non_exec_;
+  std::vector<FreeBlock> free_list_exec_;
+
+  /*! \brief Per-slab commit flags (0 = uncommitted, 1 = committed).
+   *  Lock-free: each slab is committed exactly once via compare_exchange. */
+  std::unique_ptr<std::atomic<uint8_t>[]> slab_committed_;
+  size_t num_slabs_ = 0;
+};
+
+}  // namespace orcjit
+}  // namespace ffi
+}  // namespace tvm
+
+#endif  // TVM_FFI_ORCJIT_ORCJIT_MEMORY_MANAGER_H_
diff --git a/addons/tvm_ffi_orcjit/src/ffi/orcjit_session.cc 
b/addons/tvm_ffi_orcjit/src/ffi/orcjit_session.cc
index ed2290e..0bca531 100644
--- a/addons/tvm_ffi_orcjit/src/ffi/orcjit_session.cc
+++ b/addons/tvm_ffi_orcjit/src/ffi/orcjit_session.cc
@@ -24,6 +24,7 @@
 
 #include "orcjit_session.h"
 
+#include <llvm/ADT/DenseMap.h>
 #include <llvm/ExecutionEngine/JITLink/JITLink.h>
 #include <llvm/ExecutionEngine/JITLink/x86_64.h>
 #include <llvm/ExecutionEngine/Orc/AbsoluteSymbols.h>
@@ -34,6 +35,7 @@
 #include <llvm/ExecutionEngine/Orc/Shared/ExecutorSymbolDef.h>
 #include <llvm/Support/DynamicLibrary.h>
 #include <llvm/Support/Error.h>
+#include <llvm/Support/Process.h>
 #include <llvm/Support/TargetSelect.h>
 #include <llvm/TargetParser/SubtargetFeature.h>
 #include <tvm/ffi/cast.h>
@@ -44,6 +46,8 @@
 #include <cstddef>
 #include <cstring>
 
+#include "orcjit_memory_manager.h"
+
 #ifdef _WIN32
 #ifndef NOMINMAX
 #define NOMINMAX
@@ -441,26 +445,188 @@ class DLLImportDefinitionGenerator : public 
llvm::orc::DefinitionGenerator {
 };
 #endif  // _WIN32
 
-ORCJITExecutionSessionObj::ORCJITExecutionSessionObj(const std::string& 
orc_rt_path)
+#if defined(__linux__) && (defined(__x86_64__) || defined(_M_X64))
+/*! \brief Fix LLVM JITLink GOTPCRELX relaxation bug (x86_64).
+ *
+ * optimizeGOTAndStubAccesses() relaxes `call *foo@GOTPCREL(%rip)` (ff 15)
+ * to `addr32 call foo` (67 e8) and sets the edge to Pointer32 (absolute
+ * 32-bit).  But `call rel32` is always PC-relative — the CPU computes
+ * target = RIP + imm32, not target = imm32.  The Pointer32 fixup writes
+ * the absolute address, producing a wrong displacement.
+ *
+ * This only manifests when external symbols resolve to low addresses
+ * (< 4 GB, e.g. PLT entries in a non-PIE executable) while JIT code is
+ * at high addresses (the arena at 0x7f...).  The optimization fires
+ * because isUInt<32>(target) is true, but the resulting fixup is wrong.
+ *
+ * The PreFixupPass reverts broken relaxations back to indirect calls
+ * through the GOT.  See orcjit_memory_manager.h for full context.
+ */
+class GOTPCRELXFixPlugin : public llvm::orc::ObjectLinkingLayer::Plugin {
+ public:
+  void modifyPassConfig(llvm::orc::MaterializationResponsibility& MR, 
llvm::jitlink::LinkGraph& G,
+                        llvm::jitlink::PassConfiguration& Config) override {
+    Config.PreFixupPasses.emplace_back(fixBrokenGOTPCRELXRelaxation);
+  }
+  llvm::Error notifyFailed(llvm::orc::MaterializationResponsibility& MR) 
override {
+    return llvm::Error::success();
+  }
+  llvm::Error notifyRemovingResources(llvm::orc::JITDylib& JD, 
llvm::orc::ResourceKey K) override {
+    return llvm::Error::success();
+  }
+  void notifyTransferringResources(llvm::orc::JITDylib& JD, 
llvm::orc::ResourceKey DstKey,
+                                   llvm::orc::ResourceKey SrcKey) override {}
+
+ private:
+  /*! \brief Correct broken GOTPCRELX relaxations produced by
+   *         optimizeGOTAndStubAccesses().
+   *
+   * Strategy:
+   *  1. Build target-symbol → GOT-entry-symbol map (O(B+S) up front).
+   *  2. For every Pointer32 edge whose preceding bytes are 67 e8
+   *     (relaxed call) or e9 (relaxed jmp):
+   *     - If the target is reachable via a signed 32-bit PC-relative
+   *       displacement, change the edge to BranchPCRel32.
+   *     - Otherwise revert the relaxation: restore the original
+   *       indirect-call/jmp opcode bytes (ff 15 / ff 25), retarget
+   *       the edge to the GOT entry, and use PCRel32 with addend 0
+   *       (JITLink normalises GOTPCRELX addends to 0).
+   */
+  static llvm::Error fixBrokenGOTPCRELXRelaxation(llvm::jitlink::LinkGraph& G) 
{
+    using namespace llvm::jitlink;
+    // Build block → first symbol at offset 0 (for GOT entry symbol lookup).
+    llvm::DenseMap<Block*, Symbol*> BlockToSym;
+    for (auto* Sym : G.defined_symbols()) {
+      if (Sym->getOffset() == 0 && !BlockToSym.count(&Sym->getBlock())) {
+        BlockToSym[&Sym->getBlock()] = Sym;
+      }
+    }
+
+    // Build target symbol → GOT entry symbol map.
+    // GOT entries are pointer-sized blocks with exactly one Pointer64 edge.
+    llvm::DenseMap<Symbol*, Symbol*> SymToGOTSym;
+    for (auto* B : G.blocks()) {
+      if (B->getSize() != G.getPointerSize()) continue;
+      if (B->edges_size() != 1) continue;
+      auto& E = *B->edges().begin();
+      if (E.getKind() == x86_64::Pointer64) {
+        auto It = BlockToSym.find(B);
+        if (It != BlockToSym.end()) {
+          SymToGOTSym[&E.getTarget()] = It->second;
+        }
+      }
+    }
+
+    for (auto* B : G.blocks()) {
+      for (auto& E : B->edges()) {
+        if (E.getKind() != x86_64::Pointer32) continue;
+        if (E.getOffset() < 2) continue;
+
+        auto MutableContent = B->getMutableContent(G);
+        auto* FixupData = reinterpret_cast<uint8_t*>(MutableContent.data()) + 
E.getOffset();
+        uint8_t Prev2 = FixupData[-2];
+        uint8_t Prev1 = FixupData[-1];
+
+        bool isRelaxedCall = (Prev2 == 0x67 && Prev1 == 0xe8);
+        bool isRelaxedJmp = (Prev1 == 0xe9);
+        if (!isRelaxedCall && !isRelaxedJmp) continue;
+
+        // Check if PC-relative displacement would fit.
+        auto TargetAddr = E.getTarget().getAddress();
+        auto FixupAddr = B->getFixupAddress(E);
+        int64_t Displacement = TargetAddr.getValue() - (FixupAddr.getValue() + 
4) + E.getAddend();
+        if (llvm::isInt<32>(Displacement)) {
+          E.setKind(x86_64::BranchPCRel32);
+          continue;
+        }
+
+        // Distance doesn't fit — revert to indirect call/jmp through GOT.
+        auto It = SymToGOTSym.find(&E.getTarget());
+        if (It == SymToGOTSym.end()) {
+          return llvm::make_error<llvm::StringError>(
+              "Cannot revert GOTPCRELX relaxation: no GOT entry for " +
+                  (E.getTarget().hasName() ? 
std::string(*E.getTarget().getName())
+                                           : std::string("<anon>")),
+              llvm::inconvertibleErrorCode());
+        }
+
+        Symbol* GOTSym = It->second;
+        if (isRelaxedCall) {
+          // Restore: 67 e8 → ff 15 (call *[rip+disp32])
+          FixupData[-2] = 0xff;
+          FixupData[-1] = 0x15;
+        } else {
+          // Restore: e9 XX XX XX XX 90 → ff 25 XX XX XX XX
+          FixupData[-1] = 0xff;
+          FixupData[0] = 0x25;
+          // For jmp, the optimization shifted offset by -1; shift back.
+          E.setOffset(E.getOffset() + 1);
+        }
+        E.setKind(x86_64::PCRel32);
+        E.setTarget(*GOTSym);
+        E.setAddend(0);
+      }
+    }
+    return llvm::Error::success();
+  }
+};
+#endif  // __linux__ && __x86_64__
+
+ORCJITExecutionSessionObj::ORCJITExecutionSessionObj(const std::string& 
orc_rt_path,
+                                                     int64_t arena_size_bytes)
     : jit_(nullptr) {
-  // Helper: force JITLink's ObjectLinkingLayer on platforms where
-  // the default RTDyldObjectLinkingLayer won't work.
+  // Create arena memory manager — pre-reserves contiguous VA region so all
+  // JIT allocations stay within PC-relative relocation range (±2GB x86_64,
+  // ±4GB AArch64).  Eliminates scattered-mmap relocation overflow (LLVM 
#173269).
   //
-  // macOS: MachOPlatform (via ExecutorNativePlatform) requires 
ObjectLinkingLayer.
+  // arena_size_bytes: 0 = arch default (4GB x86_64, 8GB AArch64, with 
fallback),
+  //                   >0 = custom size, <0 = disable arena.
+  // The parameter is Linux-only; on macOS/Windows the arena is compiled out
+  // entirely (see #ifdef below) and the value is ignored.
   //
-  // Windows: LLJIT defaults to RTDyldObjectLinkingLayer for COFF x86_64
-  // (see LLJIT.cpp, LLJITBuilderState::prepareForConstruction). We need
-  // ObjectLinkingLayer because:
-  //   1. Our InitFiniPlugin inherits ObjectLinkingLayer::Plugin — RTDyld has
-  //      no plugin API, so the static_cast<ObjectLinkingLayer&> would crash.
-  //   2. We skip the ORC runtime on Windows (COFFPlatform requires MSVC CRT
-  //      symbols like _CxxThrowException, RTTI vtables, iostream objects that
-  //      are not resolvable in the JIT), so we handle .CRT$XC*/.CRT$XT*
-  //      init/fini sections ourselves via the plugin.
+  // The default is strictly larger than the relocation limit so the arena is
+  // never the bottleneck — JITLink's own overflow check fires first, matching
+  // dlopen/ld.so semantics.  The constructor halves capacity on mmap failure
+  // (RLIMIT_AS, containers) down to 256 MB.
   //
-  // Linux: LLJIT already defaults to ObjectLinkingLayer for ELF, no override 
needed.
-  auto setup_builder = [](llvm::orc::LLJITBuilder& builder) {
-#if defined(__APPLE__) || defined(_WIN32)
+  // LLJIT auto-configures ObjectLinkingLayer (JITLink) on x86_64 and aarch64
+  // Linux (see LLJITBuilderState::prepareForConstruction).  We override
+  // the layer creator to pass our arena MM.  macOS/Windows are excluded:
+  // macOS MachOPlatform teardown crashes with the arena; Windows needs
+  // further testing.
+#ifdef __linux__
+  if (arena_size_bytes >= 0) {
+    auto page_size = llvm::sys::Process::getPageSizeEstimate();
+    size_t capacity;
+    if (arena_size_bytes > 0) {
+      capacity = static_cast<size_t>(arena_size_bytes);
+    } else {
+#if defined(__aarch64__)
+      capacity = ArenaJITLinkMemoryManager::kDefaultArenaCapacity_AArch64;
+#else
+      capacity = ArenaJITLinkMemoryManager::kDefaultArenaCapacity_x86_64;
+#endif
+    }
+    memory_manager_ = std::make_unique<ArenaJITLinkMemoryManager>(page_size, 
capacity);
+  }
+#endif
+
+  auto setup_builder = [this](llvm::orc::LLJITBuilder& builder) {
+#ifdef __linux__
+    if (memory_manager_) {
+      builder.setObjectLinkingLayerCreator(
+          [this](llvm::orc::ExecutionSession& ES)
+              -> llvm::Expected<std::unique_ptr<llvm::orc::ObjectLayer>> {
+            auto OLL = std::make_unique<llvm::orc::ObjectLinkingLayer>(ES, 
*memory_manager_);
+#if defined(__x86_64__) || defined(_M_X64)
+            OLL->addPlugin(std::make_unique<GOTPCRELXFixPlugin>());
+#endif
+            return OLL;
+          });
+    }  // if (memory_manager_)
+#elif defined(__APPLE__) || defined(_WIN32)
+    // macOS: MachOPlatform (via ExecutorNativePlatform) requires 
ObjectLinkingLayer.
+    // Windows: need ObjectLinkingLayer for InitFiniPlugin and 
DLLImportDefinitionGenerator.
     builder.setObjectLinkingLayerCreator(
         [](llvm::orc::ExecutionSession& ES)
             -> llvm::Expected<std::unique_ptr<llvm::orc::ObjectLayer>> {
@@ -608,8 +774,10 @@ ORCJITExecutionSessionObj::ORCJITExecutionSessionObj(const 
std::string& orc_rt_p
 #endif
 }
 
-ORCJITExecutionSession::ORCJITExecutionSession(const std::string& orc_rt_path) 
{
-  ObjectPtr<ORCJITExecutionSessionObj> obj = 
make_object<ORCJITExecutionSessionObj>(orc_rt_path);
+ORCJITExecutionSession::ORCJITExecutionSession(const std::string& orc_rt_path,
+                                               int64_t arena_size_bytes) {
+  ObjectPtr<ORCJITExecutionSessionObj> obj =
+      make_object<ORCJITExecutionSessionObj>(orc_rt_path, arena_size_bytes);
   data_ = std::move(obj);
 }
 
diff --git a/addons/tvm_ffi_orcjit/src/ffi/orcjit_session.h 
b/addons/tvm_ffi_orcjit/src/ffi/orcjit_session.h
index cacffcd..e64c17b 100644
--- a/addons/tvm_ffi_orcjit/src/ffi/orcjit_session.h
+++ b/addons/tvm_ffi_orcjit/src/ffi/orcjit_session.h
@@ -31,9 +31,12 @@
 #include <tvm/ffi/string.h>
 
 #include <atomic>
+#include <memory>
 #include <string>
 #include <unordered_map>
 
+#include "orcjit_memory_manager.h"
+
 namespace tvm {
 namespace ffi {
 namespace orcjit {
@@ -52,7 +55,8 @@ class ORCJITExecutionSessionObj : public Object {
   /*!
    * \brief Default constructor (for make_object)
    */
-  explicit ORCJITExecutionSessionObj(const std::string& orc_rt_path = "");
+  explicit ORCJITExecutionSessionObj(const std::string& orc_rt_path = "",
+                                     int64_t arena_size_bytes = 0);
 
   /*!
    * \brief Create a new DynamicLibrary (JITDylib) in this session
@@ -95,6 +99,8 @@ class ORCJITExecutionSessionObj : public Object {
   void AddPendingDeinitializer(llvm::orc::JITDylib* jd, const InitFiniEntry& 
entry);
 
  private:
+  /*! \brief Arena memory manager — must be declared before jit_ for 
destruction order */
+  std::unique_ptr<ArenaJITLinkMemoryManager> memory_manager_;
   /*! \brief The LLVM ORC JIT instance */
   std::unique_ptr<llvm::orc::LLJIT> jit_;
 
@@ -116,7 +122,8 @@ class ORCJITExecutionSession : public ObjectRef {
    * \brief Create a new ExecutionSession
    * \return The created execution session instance
    */
-  explicit ORCJITExecutionSession(const std::string& orc_rt_path = "");
+  explicit ORCJITExecutionSession(const std::string& orc_rt_path = "",
+                                  int64_t arena_size_bytes = 0);
   // Required: define object reference methods
   TVM_FFI_DEFINE_OBJECT_REF_METHODS_NOTNULLABLE(ORCJITExecutionSession, 
ObjectRef,
                                                 ORCJITExecutionSessionObj);
diff --git a/addons/tvm_ffi_orcjit/tests/sources/c/fake_fatbin.c 
b/addons/tvm_ffi_orcjit/tests/sources/c/fake_fatbin.c
new file mode 100644
index 0000000..867ac73
--- /dev/null
+++ b/addons/tvm_ffi_orcjit/tests/sources/c/fake_fatbin.c
@@ -0,0 +1,64 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+/*
+ * Simulates an NVCC-compiled object with a large .nv_fatbin device blob.
+ * The fatbin data is referenced only by absolute relocations (R_*_64 /
+ * R_AARCH64_ABS64), never by PC-relative relocations.  This lets us test
+ * overflow-region classification without needing a real CUDA toolchain.
+ *
+ * KEY DETAIL: References go through a pointer in .data (generates
+ * R_AARCH64_ABS64 / R_X86_64_64), not via ADRP/RIP-relative.  This
+ * mirrors real NVCC output where __NV_fatbin_* uses absolute relocations.
+ */
+#include <stdint.h>
+#include <tvm/ffi/c_api.h>
+
+#ifdef __APPLE__
+__attribute__((section("__DATA,.nv_fatbin"), used))
+#else
+__attribute__((section(".nv_fatbin"), used))
+#endif
+static const uint8_t fake_fatbin_data[4 * 1024 * 1024] = {0};
+
+/* Indirect reference: .data holds an absolute-relocation pointer to
+   .nv_fatbin.  Code accesses .data via PC-relative (ADRP / RIP), and
+   .data→.nv_fatbin is absolute.  No PC-relative edge crosses from any
+   section to .nv_fatbin, matching real NVCC objects. */
+static const void* const fatbin_ptr = fake_fatbin_data;
+static const uint64_t fatbin_size = sizeof(fake_fatbin_data);
+
+/* get_fatbin_size: return the size of the fake fatbin blob. */
+TVM_FFI_DLL_EXPORT int __tvm_ffi_get_fatbin_size(void* self, const TVMFFIAny* 
args,
+                                                 int32_t num_args, TVMFFIAny* 
result) {
+  result->type_index = kTVMFFIInt;
+  result->zero_padding = 0;
+  result->v_int64 = (int64_t)fatbin_size;
+  return 0;
+}
+
+/* get_fatbin_addr: return the address of the fake fatbin data.
+   Used by tests to verify overflow sections land outside the arena. */
+TVM_FFI_DLL_EXPORT int __tvm_ffi_get_fatbin_addr(void* self, const TVMFFIAny* 
args,
+                                                 int32_t num_args, TVMFFIAny* 
result) {
+  result->type_index = kTVMFFIInt;
+  result->zero_padding = 0;
+  result->v_int64 = (int64_t)(uintptr_t)fatbin_ptr;
+  return 0;
+}
diff --git a/addons/tvm_ffi_orcjit/tests/sources/c/test_addr.c 
b/addons/tvm_ffi_orcjit/tests/sources/c/test_addr.c
new file mode 100644
index 0000000..c57281b
--- /dev/null
+++ b/addons/tvm_ffi_orcjit/tests/sources/c/test_addr.c
@@ -0,0 +1,33 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+/*
+ * Returns the code address of this function — for arena co-location tests.
+ * Load into multiple libraries to verify they land in the same arena region.
+ */
+#include <stdint.h>
+#include <tvm/ffi/c_api.h>
+
+TVM_FFI_DLL_EXPORT int __tvm_ffi_code_address(void* self, const TVMFFIAny* 
args, int32_t num_args,
+                                              TVMFFIAny* result) {
+  result->type_index = kTVMFFIInt;
+  result->zero_padding = 0;
+  result->v_int64 = (int64_t)(uintptr_t)&__tvm_ffi_code_address;
+  return 0;
+}
diff --git a/addons/tvm_ffi_orcjit/tests/sources/c/test_hidden_caller.c 
b/addons/tvm_ffi_orcjit/tests/sources/c/test_hidden_caller.c
new file mode 100644
index 0000000..4e33285
--- /dev/null
+++ b/addons/tvm_ffi_orcjit/tests/sources/c/test_hidden_caller.c
@@ -0,0 +1,61 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+/*
+ * Caller library for ADRP overflow test (LLVM issue #173269).
+ *
+ * Takes the ADDRESS of hidden_helper_add via ADRP+ADD (no GOT,
+ * because of hidden visibility).  When this object and
+ * test_hidden_helper.o are in different mmap allocations >4GB
+ * apart, the ADRP immediate overflows — silent truncation on
+ * AArch64 causes a segfault.
+ *
+ * The arena memory manager fixes this by placing all objects
+ * in contiguous VA space (<< 4GB).
+ */
+#include <stdint.h>
+#include <tvm/ffi/c_api.h>
+
+/* Same hidden declaration — compiler uses ADRP+ADD to take address */
+__attribute__((visibility("hidden"))) extern int64_t hidden_helper_add(int64_t 
a, int64_t b);
+
+typedef int64_t (*binop_t)(int64_t, int64_t);
+
+/* call_hidden_add: take address of hidden_helper_add, then call via pointer.
+   On AArch64, generates:
+     ADRP x0, hidden_helper_add@PAGE       (R_AARCH64_ADR_PREL_PG_HI21, ±4GB)
+     ADD  x0, x0, hidden_helper_add@PAGEOFF (R_AARCH64_ADD_ABS_LO12_NC)
+   When hidden_helper_add is in a different allocation >4GB away, ADRP 
overflows. */
+TVM_FFI_DLL_EXPORT int __tvm_ffi_call_hidden_add(void* self, const TVMFFIAny* 
args,
+                                                 int32_t num_args, TVMFFIAny* 
result) {
+  volatile binop_t fn = &hidden_helper_add;
+  result->type_index = kTVMFFIInt;
+  result->zero_padding = 0;
+  result->v_int64 = fn(args[0].v_int64, args[1].v_int64);
+  return 0;
+}
+
+/* Return the address of this function's code — for co-location tests */
+TVM_FFI_DLL_EXPORT int __tvm_ffi_caller_code_address(void* self, const 
TVMFFIAny* args,
+                                                     int32_t num_args, 
TVMFFIAny* result) {
+  result->type_index = kTVMFFIInt;
+  result->zero_padding = 0;
+  result->v_int64 = (int64_t)(uintptr_t)&__tvm_ffi_caller_code_address;
+  return 0;
+}
diff --git a/addons/tvm_ffi_orcjit/tests/sources/c/test_hidden_helper.c 
b/addons/tvm_ffi_orcjit/tests/sources/c/test_hidden_helper.c
new file mode 100644
index 0000000..811218e
--- /dev/null
+++ b/addons/tvm_ffi_orcjit/tests/sources/c/test_hidden_helper.c
@@ -0,0 +1,53 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+/*
+ * Helper library for ADRP overflow test.
+ * Defines a hidden-visibility function whose ADDRESS is taken
+ * by test_hidden_caller.c.  On AArch64, the caller uses
+ * ADRP+ADD (no GOT) to compute the address — this overflows
+ * when the two objects are in different allocations >4GB apart.
+ *
+ * Reference: LLVM issue #173269
+ */
+#include <stdint.h>
+#include <tvm/ffi/c_api.h>
+
+/* Hidden visibility: caller uses ADRP+ADD instead of GOT */
+__attribute__((visibility("hidden"))) int64_t hidden_helper_add(int64_t a, 
int64_t b) {
+  return a + b;
+}
+
+/* Export a TVM FFI function that calls hidden_helper_add directly */
+TVM_FFI_DLL_EXPORT int __tvm_ffi_hidden_add(void* self, const TVMFFIAny* args, 
int32_t num_args,
+                                            TVMFFIAny* result) {
+  result->type_index = kTVMFFIInt;
+  result->zero_padding = 0;
+  result->v_int64 = hidden_helper_add(args[0].v_int64, args[1].v_int64);
+  return 0;
+}
+
+/* Return the address of this function's code — for co-location tests */
+TVM_FFI_DLL_EXPORT int __tvm_ffi_helper_code_address(void* self, const 
TVMFFIAny* args,
+                                                     int32_t num_args, 
TVMFFIAny* result) {
+  result->type_index = kTVMFFIInt;
+  result->zero_padding = 0;
+  result->v_int64 = (int64_t)(uintptr_t)&__tvm_ffi_helper_code_address;
+  return 0;
+}
diff --git a/addons/tvm_ffi_orcjit/tests/sources/cc/fake_fatbin.cc 
b/addons/tvm_ffi_orcjit/tests/sources/cc/fake_fatbin.cc
new file mode 100644
index 0000000..8fc1636
--- /dev/null
+++ b/addons/tvm_ffi_orcjit/tests/sources/cc/fake_fatbin.cc
@@ -0,0 +1,64 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+// Simulates an NVCC-compiled object with a large .nv_fatbin device blob.
+// The fatbin data is referenced only by absolute relocations (R_*_64 /
+// R_AARCH64_ABS64), never by PC-relative relocations.  This lets us test
+// overflow-region classification without needing a real CUDA toolchain.
+//
+// KEY DETAIL: References go through a pointer in .data (generates
+// R_AARCH64_ABS64 / R_X86_64_64), not via ADRP/RIP-relative.  This
+// mirrors real NVCC output where __NV_fatbin_* uses absolute relocations.
+
+#include <tvm/ffi/c_api.h>
+
+#include <cstdint>
+
+#ifdef __APPLE__
+__attribute__((section("__DATA,.nv_fatbin"), used))
+#else
+__attribute__((section(".nv_fatbin"), used))
+#endif
+static const uint8_t fake_fatbin_data[4 * 1024 * 1024] = {0};
+
+// Indirect reference: .data holds an absolute-relocation pointer to
+// .nv_fatbin.  Code accesses .data via PC-relative (ADRP / RIP), and
+// .data→.nv_fatbin is absolute.  No PC-relative edge crosses from any
+// section to .nv_fatbin, matching real NVCC objects.
+static const void* const fatbin_ptr = fake_fatbin_data;
+static const uint64_t fatbin_size = sizeof(fake_fatbin_data);
+
+// get_fatbin_size: return the size of the fake fatbin blob.
+extern "C" {
+TVM_FFI_DLL_EXPORT int __tvm_ffi_get_fatbin_size(void* self, const TVMFFIAny* 
args,
+                                                 int32_t num_args, TVMFFIAny* 
result) {
+  result->type_index = kTVMFFIInt;
+  result->zero_padding = 0;
+  result->v_int64 = static_cast<int64_t>(fatbin_size);
+  return 0;
+}
+
+// get_fatbin_addr: return the address of the fake fatbin data.
+// Used by tests to verify overflow sections land outside the arena.
+TVM_FFI_DLL_EXPORT int __tvm_ffi_get_fatbin_addr(void* self, const TVMFFIAny* 
args,
+                                                 int32_t num_args, TVMFFIAny* 
result) {
+  result->type_index = kTVMFFIInt;
+  result->zero_padding = 0;
+  result->v_int64 = reinterpret_cast<int64_t>(fatbin_ptr);
+  return 0;
+}
+}
diff --git a/addons/tvm_ffi_orcjit/tests/sources/cc/test_addr.cc 
b/addons/tvm_ffi_orcjit/tests/sources/cc/test_addr.cc
new file mode 100644
index 0000000..43bb504
--- /dev/null
+++ b/addons/tvm_ffi_orcjit/tests/sources/cc/test_addr.cc
@@ -0,0 +1,26 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+// Returns the code address of this function — for arena co-location tests.
+// Load into multiple libraries to verify they land in the same arena region.
+
+#include <tvm/ffi/function.h>
+
+#include <cstdint>
+
+int64_t code_address_impl() { return 
reinterpret_cast<int64_t>(&code_address_impl); }
+TVM_FFI_DLL_EXPORT_TYPED_FUNC(code_address, code_address_impl);
diff --git a/addons/tvm_ffi_orcjit/tests/sources/cc/test_hidden_caller.cc 
b/addons/tvm_ffi_orcjit/tests/sources/cc/test_hidden_caller.cc
new file mode 100644
index 0000000..babd6d3
--- /dev/null
+++ b/addons/tvm_ffi_orcjit/tests/sources/cc/test_hidden_caller.cc
@@ -0,0 +1,52 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+// Caller library for ADRP overflow test (LLVM issue #173269).
+//
+// Takes the ADDRESS of hidden_helper_add via ADRP+ADD (no GOT,
+// because of hidden visibility).  When this object and
+// test_hidden_helper.o are in different mmap allocations >4GB
+// apart, the ADRP immediate overflows — silent truncation on
+// AArch64 causes a segfault.
+//
+// The arena memory manager fixes this by placing all objects
+// in contiguous VA space (<< 4GB).
+
+#include <tvm/ffi/c_api.h>
+
+#include <cstdint>
+
+// Same hidden declaration — compiler uses ADRP+ADD to take address
+__attribute__((visibility("hidden"))) extern int64_t hidden_helper_add(int64_t 
a, int64_t b);
+
+using binop_t = int64_t (*)(int64_t, int64_t);
+
+// call_hidden_add: take address of hidden_helper_add, then call via pointer.
+// On AArch64, generates:
+//   ADRP x0, hidden_helper_add@PAGE       (R_AARCH64_ADR_PREL_PG_HI21, ±4GB)
+//   ADD  x0, x0, hidden_helper_add@PAGEOFF (R_AARCH64_ADD_ABS_LO12_NC)
+// When hidden_helper_add is in a different allocation >4GB away, ADRP 
overflows.
+extern "C" {
+TVM_FFI_DLL_EXPORT int __tvm_ffi_call_hidden_add(void* self, const TVMFFIAny* 
args,
+                                                 int32_t num_args, TVMFFIAny* 
result) {
+  volatile binop_t fn = &hidden_helper_add;
+  result->type_index = kTVMFFIInt;
+  result->zero_padding = 0;
+  result->v_int64 = fn(args[0].v_int64, args[1].v_int64);
+  return 0;
+}
+}
diff --git a/addons/tvm_ffi_orcjit/tests/sources/cc/test_hidden_helper.cc 
b/addons/tvm_ffi_orcjit/tests/sources/cc/test_hidden_helper.cc
new file mode 100644
index 0000000..8de2de1
--- /dev/null
+++ b/addons/tvm_ffi_orcjit/tests/sources/cc/test_hidden_helper.cc
@@ -0,0 +1,44 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+// Helper library for ADRP overflow test.
+// Defines a hidden-visibility function whose ADDRESS is taken
+// by test_hidden_caller.cc.  On AArch64, the caller uses
+// ADRP+ADD (no GOT) to compute the address — this overflows
+// when the two objects are in different allocations >4GB apart.
+//
+// Reference: LLVM issue #173269
+
+#include <tvm/ffi/c_api.h>
+
+#include <cstdint>
+
+// Hidden visibility: caller uses ADRP+ADD instead of GOT
+__attribute__((visibility("hidden"))) int64_t hidden_helper_add(int64_t a, 
int64_t b) {
+  return a + b;
+}
+
+// Export a TVM FFI function that calls hidden_helper_add directly
+extern "C" {
+TVM_FFI_DLL_EXPORT int __tvm_ffi_hidden_add(void* self, const TVMFFIAny* args, 
int32_t num_args,
+                                            TVMFFIAny* result) {
+  result->type_index = kTVMFFIInt;
+  result->zero_padding = 0;
+  result->v_int64 = hidden_helper_add(args[0].v_int64, args[1].v_int64);
+  return 0;
+}
+}
diff --git a/addons/tvm_ffi_orcjit/tests/test_arena.py 
b/addons/tvm_ffi_orcjit/tests/test_arena.py
new file mode 100644
index 0000000..a5ad22a
--- /dev/null
+++ b/addons/tvm_ffi_orcjit/tests/test_arena.py
@@ -0,0 +1,674 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+"""Tests for JIT memory arena — verifies co-location and relocation safety.
+
+Background
+----------
+LLVM ORC JIT v2 uses ``InProcessMemoryMapper`` (backed by
+``MapperJITLinkMemoryManager``) to allocate JIT memory.  Each allocation
+is a separate ``mmap(MAP_ANONYMOUS)`` call whose address the kernel picks.
+Under virtual-address (VA) pressure — leaked slabs from failed
+materializations, long-running pytest sessions holding tracebacks, or
+simply a fragmented address space — the kernel can place successive
+allocations far apart.
+
+This matters for **PC-relative relocations with limited range**:
+
+- **x86_64 R_X86_64_PC32 / Delta32**: ±2 GB range.  GCC-compiled C++
+  objects reference ``__dso_handle`` (used by ``__cxa_atexit`` for DSO
+  identification) via PC32 when the symbol has hidden visibility.
+  LLVM's ``ELFNixPlatform`` defines ``__dso_handle`` per JITDylib in a
+  separate ``DSOHandleMaterializationUnit`` — a tiny ``LinkGraph``
+  allocated independently of the code that references it.  If those two
+  allocations land >2 GB apart, the Delta32 fixup overflows.
+
+- **AArch64 ADRP+ADD**: ±4 GB range.  Hidden-visibility cross-object
+  calls use ADRP (page-relative) which has the same scatter problem
+  at a wider threshold.
+
+The **arena memory manager** solves this by pre-reserving a contiguous
+VA region (default 4 GB x86_64 / 8 GB AArch64, with fallback to smaller
+sizes) via ``mmap(PROT_NONE)`` and bump-allocating within it, guaranteeing
+all JIT allocations stay within relocation range regardless of external
+VA pressure.
+
+Note on ``-fPIC`` vs ``-fpie``
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+With ``-fPIC`` (the default for shared-library code), GCC may use
+``R_X86_64_GOTPCRELX`` (GOT-relative, load through the GOT) for
+hidden-visibility externals like ``__dso_handle``.  GOT entries are
+co-located with code, so there is no ±2 GB range issue.  With ``-fpie``
+(position-independent executable), GCC prefers the shorter direct
+``R_X86_64_PC32``, which *does* have the ±2 GB limit.  The Delta32
+overflow tests (test 6) therefore build with ``-fpie`` to force the
+problematic relocation type.
+
+Test structure
+--------------
+1. **Co-location** (test 1): arena keeps objects within 16 MB.
+2. **Scatter baseline** (test 2): without arena, VA blocker pushes
+   objects >128 MB apart — proves arena is responsible for co-location.
+3. **Hidden-symbol calls** (test 3): ADRP/PC32 cross-object calls
+   succeed under VA pressure with arena.
+4. **Large data section** (test 4): 4 MB ``.nv_fatbin`` section loads
+   correctly within the arena.
+5. **Overflow section** (test 5): ``.nv_fatbin`` data is allocated
+   outside the arena via separate mmap.
+6. **Leaked materialization** (test 6): ``__dso_handle`` resolves after
+   prior sessions leaked mmap slabs from failed materializations.
+7. **Delta32 overflow** (test 7): ``-fpie`` GCC objects + 3 GB VA
+   blocker.  With arena → PASSES; without arena → Delta32 overflow.
+
+All tests use a small arena (16 MB) and 256 MB-3 GB VA blockers -- safe
+for CI containers.
+"""
+
+from __future__ import annotations
+
+import ctypes
+import ctypes.util
+import functools
+import platform
+import sys
+from pathlib import Path
+
+import pytest
+from tvm_ffi_orcjit import ExecutionSession
+from utils import build_test_objects
+
+# ---------------------------------------------------------------------------
+# Setup
+# ---------------------------------------------------------------------------
+
+OBJ_DIR = build_test_objects()
+
+_KNOWN_SUBDIRS = [
+    "c",
+    "c-gcc",
+    "cc",
+    "cc-gcc",
+    "cc-gcc-pie",
+    "c-appleclang",
+    "cc-appleclang",
+    "c-msvc",
+    "c-clang-cl",
+]
+
+_PIE_VARIANT_MARKER = "-pie"
+
+
+def obj(name: str) -> str:
+    """Return path to a pre-built test object file, or skip if missing."""
+    path = OBJ_DIR / f"{name}.o"
+    if not path.exists():
+        pytest.skip(f"{path.name} not found (not built)")
+    return str(path)
+
+
+def _discover_c_variants() -> list[str]:
+    """Discover available C-only compiler variants."""
+    return [
+        s
+        for s in _KNOWN_SUBDIRS
+        if s.startswith("c")
+        and not s.startswith("cc")
+        and _PIE_VARIANT_MARKER not in s
+        and (OBJ_DIR / s / "test_funcs.o").exists()
+    ]
+
+
+def _discover_cpp_variants() -> list[str]:
+    """Discover available C++ compiler variants (for __dso_handle tests)."""
+    return [
+        s
+        for s in _KNOWN_SUBDIRS
+        if s.startswith("cc")
+        and _PIE_VARIANT_MARKER not in s
+        and (OBJ_DIR / s / "test_funcs.o").exists()
+    ]
+
+
+def _discover_gcc_cpp_variants() -> list[str]:
+    """Discover GCC C++ variants (emit R_X86_64_PC32 for __dso_handle)."""
+    return [v for v in _discover_cpp_variants() if "gcc" in v]
+
+
+def _discover_pie_cpp_variants() -> list[str]:
+    """Discover PIE C++ variants built with -fpie.
+
+    PIE objects force R_X86_64_PC32 (direct, ±2GB) for __dso_handle
+    instead of R_X86_64_GOTPCRELX (GOT-relative, unlimited range).
+    Used exclusively by the Delta32 overflow tests (test 6).
+    """
+    return [
+        s
+        for s in _KNOWN_SUBDIRS
+        if _PIE_VARIANT_MARKER in s and (OBJ_DIR / s / "test_funcs.o").exists()
+    ]
+
+
+_c_variants = _discover_c_variants()
+_cpp_variants = _discover_cpp_variants()
+_gcc_cpp_variants = _discover_gcc_cpp_variants()
+_pie_cpp_variants = _discover_pie_cpp_variants()
+_all_variants = _c_variants + _cpp_variants
+
+_is_linux = sys.platform == "linux"
+_is_x86_64 = platform.machine() in ("x86_64", "AMD64")
+
+# Arena test parameters
+_ARENA_SIZE = 16 * 1024 * 1024  # 16MB — small arena for testing
+_BLOCK_RADIUS = 256 * 1024 * 1024  # 256MB — safe for CI containers
+_DSO_BLOCK_RADIUS = 3 * 1024 * 1024 * 1024  # 3GB — needed to overflow PC32 
(±2GB)
+
+_PROT_NONE = 0
+_MAP_PRIVATE_ANON = 0x22  # MAP_PRIVATE | MAP_ANONYMOUS
+_MAP_FIXED_NOREPLACE = 0x100000
+
+
+# ---------------------------------------------------------------------------
+# VA blocker — fills nearby free VA gaps to force distant mmap placement
+# ---------------------------------------------------------------------------
+
+
[email protected]_cache(maxsize=1)
+def _get_libc() -> ctypes.CDLL:
+    """Get a ctypes handle to libc with correct mmap/munmap signatures."""
+    libc = ctypes.CDLL(ctypes.util.find_library("c") or "libc.so.6", 
use_errno=True)
+    libc.mmap.restype = ctypes.c_void_p
+    libc.mmap.argtypes = [
+        ctypes.c_void_p,
+        ctypes.c_size_t,
+        ctypes.c_int,
+        ctypes.c_int,
+        ctypes.c_int,
+        ctypes.c_long,
+    ]
+    libc.munmap.restype = ctypes.c_int
+    libc.munmap.argtypes = [ctypes.c_void_p, ctypes.c_size_t]
+    return libc
+
+
+def _parse_maps() -> list[tuple[int, int]]:
+    """Parse /proc/self/maps into sorted list of (start, end) tuples."""
+    regions = []
+    with Path("/proc/self/maps").open() as f:
+        for line in f:
+            addrs = line.split()[0].split("-")
+            regions.append((int(addrs[0], 16), int(addrs[1], 16)))
+    return sorted(regions)
+
+
+def _find_new_mappings(
+    before: set[tuple[int, int]], after: list[tuple[int, int]]
+) -> list[tuple[int, int]]:
+    """Find mappings present in *after* but not in *before*."""
+    return [(s, e) for s, e in after if (s, e) not in before]
+
+
+def block_nearby_va(center: int, radius: int = _BLOCK_RADIUS) -> 
list[tuple[int, int]]:
+    """Block all free VA gaps within *radius* of *center*.
+
+    Uses MAP_FIXED_NOREPLACE to place PROT_NONE mappings in every free gap
+    within [center - radius, center + radius].  This forces subsequent
+    mmap(NULL, ...) calls to land outside the blocked region.
+
+    Returns list of (addr, size) blockers to be freed later.
+    """
+    libc = _get_libc()
+    maps = _parse_maps()
+    blockers = []
+    low = max(center - radius, 0)
+    high = center + radius
+
+    for i in range(len(maps) - 1):
+        gap_start = maps[i][1]
+        gap_end = maps[i + 1][0]
+        if gap_end <= low or gap_start >= high or gap_end <= gap_start:
+            continue
+        block_start = max(gap_start, low)
+        block_end = min(gap_end, high)
+        block_size = block_end - block_start
+        if block_size <= 0:
+            continue
+        addr = libc.mmap(
+            block_start, block_size, _PROT_NONE, _MAP_PRIVATE_ANON | 
_MAP_FIXED_NOREPLACE, -1, 0
+        )
+        if addr != ctypes.c_void_p(-1).value and addr is not None:
+            blockers.append((addr, block_size))
+
+    return blockers
+
+
+def free_blockers(blockers: list[tuple[int, int]]) -> None:
+    """Free all VA blockers."""
+    libc = _get_libc()
+    for addr, size in blockers:
+        libc.munmap(addr, size)
+
+
+# ---------------------------------------------------------------------------
+# Test 1: Arena co-location — objects stay within arena range
+# ---------------------------------------------------------------------------
+
+
[email protected](not _is_linux, reason="Arena is Linux-only")
[email protected]("variant", _all_variants)
+def test_arena_colocation(variant: str) -> None:
+    """With arena, objects in separate libraries have close code addresses.
+
+    Uses a 16MB arena and inserts a 256MB VA blocker between object loads.
+    Without the arena, the blocker would push the second object far away.
+    With the arena, both objects land within the 16MB region.
+    """
+    maps_before = set(_parse_maps())
+
+    session = ExecutionSession(arena_size=_ARENA_SIZE)
+    lib1 = session.create_library("lib1")
+    lib1.add(obj(f"{variant}/test_addr"))
+    addr1 = lib1.get_function("code_address")()
+
+    # Find where LLVM placed the first allocation and block nearby VA
+    maps_after = _parse_maps()
+    new_maps = _find_new_mappings(maps_before, maps_after)
+    jit_center = max(s for s, e in new_maps) if new_maps else addr1
+
+    blockers = block_nearby_va(jit_center)
+    try:
+        lib2 = session.create_library("lib2")
+        lib2.add(obj(f"{variant}/test_addr"))
+        addr2 = lib2.get_function("code_address")()
+    finally:
+        free_blockers(blockers)
+
+    distance = abs(addr1 - addr2)
+    assert distance < _ARENA_SIZE, (
+        f"Objects should be within {_ARENA_SIZE} bytes, "
+        f"but distance is {distance} ({distance / (1024**2):.1f} MB)"
+    )
+
+
+# ---------------------------------------------------------------------------
+# Test 2: Arena effect — compare with-arena vs without-arena under VA pressure
+# ---------------------------------------------------------------------------
+
+
+def _measure_distance_under_pressure(
+    variant: str, arena_size: int, radius: int = _BLOCK_RADIUS
+) -> tuple[int | None, bool]:
+    """Load two objects under VA pressure and return (distance, overflowed).
+
+    Returns ``(distance_bytes, False)`` when both objects load successfully,
+    or ``(None, True)`` when the second load fails with a relocation overflow
+    (Page21 on AArch64, Delta32 on x86_64).
+    """
+    maps_before = set(_parse_maps())
+
+    session = ExecutionSession(arena_size=arena_size)
+    lib1 = session.create_library("lib1")
+    lib1.add(obj(f"{variant}/test_addr"))
+    addr1 = lib1.get_function("code_address")()
+
+    maps_after = _parse_maps()
+    new_maps = _find_new_mappings(maps_before, maps_after)
+    jit_center = max(s for s, e in new_maps) if new_maps else addr1
+
+    blockers = block_nearby_va(jit_center, radius=radius)
+    try:
+        lib2 = session.create_library("lib2")
+        try:
+            lib2.add(obj(f"{variant}/test_addr"))
+            addr2 = lib2.get_function("code_address")()
+        except Exception:
+            return None, True
+    finally:
+        free_blockers(blockers)
+
+    return abs(addr1 - addr2), False
+
+
[email protected](not _is_linux, reason="Arena is Linux-only")
[email protected]("variant", _all_variants)
+def test_arena_keeps_objects_close(variant: str) -> None:
+    """Arena co-locates objects that would otherwise scatter or overflow.
+
+    Runs the same workload twice under identical VA pressure — once with
+    the arena and once without — and compares the outcomes:
+
+    - **With arena**: both objects must land within the arena size (16 MB).
+    - **Without arena**: the blocker should either cause a relocation
+      overflow (proving scatter beyond relocation range) or produce a
+      measurably larger distance.
+
+    The test proves the arena is responsible for co-location by showing a
+    strictly better outcome with it enabled.  If the VA blocker happens to
+    be ineffective (e.g., LLVM slab reuse on 64k-page kernels), the test
+    still passes as long as the arena keeps objects within range.
+    """
+    # Phase 1: with arena — must always succeed and be within arena range
+    arena_dist, arena_overflow = _measure_distance_under_pressure(variant, 
arena_size=_ARENA_SIZE)
+    assert not arena_overflow, "Arena session should not overflow"
+    assert arena_dist is not None
+    assert arena_dist < _ARENA_SIZE, (
+        f"With arena, objects should be within {_ARENA_SIZE} bytes, "
+        f"but distance is {arena_dist} ({arena_dist / (1024**2):.1f} MB)"
+    )
+
+    # Phase 2: without arena — expect scatter or overflow
+    no_arena_dist, no_arena_overflow = 
_measure_distance_under_pressure(variant, arena_size=-1)
+
+    if no_arena_overflow:
+        # Relocation overflow without arena proves the blocker forced
+        # scatter beyond relocation range — arena prevented this.
+        return
+
+    assert no_arena_dist is not None
+    if no_arena_dist > arena_dist:
+        # Without arena produced a larger distance — arena effect shown.
+        return
+
+    # Blocker was ineffective (both distances are small).  The arena
+    # assertion above already passed, which is the key property.  We
+    # cannot distinguish arena effect from lucky placement here.
+    pytest.skip(
+        f"VA blocker ineffective: arena={arena_dist / 1024:.0f} KB, "
+        f"no-arena={no_arena_dist / 1024:.0f} KB — "
+        f"cannot demonstrate arena effect on this kernel"
+    )
+
+
+# ---------------------------------------------------------------------------
+# Test 3: Hidden-symbol ADRP/PC32 relocation with arena + blocker
+# ---------------------------------------------------------------------------
+
+
[email protected](not _is_linux, reason="Arena is Linux-only")
[email protected]("variant", _all_variants)
+def test_arena_hidden_symbol_with_blocker(variant: str) -> None:
+    """Arena prevents hidden-visibility relocation overflow under VA pressure.
+
+    Loads two objects with hidden-visibility cross-references (ADRP+ADD
+    on AArch64, PC32 on x86_64) with a VA blocker between them.
+    Without arena, the blocker would push objects apart causing overflow.
+    With the arena, both objects are co-located and the call succeeds.
+    """
+    maps_before = set(_parse_maps())
+
+    session = ExecutionSession(arena_size=_ARENA_SIZE)
+    lib = session.create_library("hidden_test")
+
+    # Load helper and force materialization
+    lib.add(obj(f"{variant}/test_hidden_helper"))
+    assert lib.get_function("hidden_add")(1, 2) == 3
+
+    # Block nearby VA to force scatter
+    maps_after = _parse_maps()
+    new_maps = _find_new_mappings(maps_before, maps_after)
+    jit_center = max(s for s, e in new_maps) if new_maps else 0xFFFF00000000
+
+    blockers = block_nearby_va(jit_center)
+    try:
+        lib.add(obj(f"{variant}/test_hidden_caller"))
+        fn = lib.get_function("call_hidden_add")
+        assert fn(10, 20) == 30
+    finally:
+        free_blockers(blockers)
+
+
+# ---------------------------------------------------------------------------
+# Test 4: Large data section (simulated .nv_fatbin)
+# ---------------------------------------------------------------------------
+
+
[email protected](not _is_linux, reason="Arena is Linux-only")
[email protected]("variant", _all_variants)
+def test_large_data_section(variant: str) -> None:
+    """Load object with a 4MB .nv_fatbin section — basic correctness.
+
+    The .nv_fatbin section is referenced only by absolute relocations,
+    so it can live anywhere.  This test verifies the object loads and
+    the function works.  The 4MB section fits in the arena.
+    """
+    session = ExecutionSession()
+    lib = session.create_library("fatbin")
+    lib.add(obj(f"{variant}/fake_fatbin"))
+    fn = lib.get_function("get_fatbin_size")
+    assert fn() == 4 * 1024 * 1024
+
+
+# ---------------------------------------------------------------------------
+# Test 5: Overflow section — .nv_fatbin lands outside the arena
+# ---------------------------------------------------------------------------
+
+
[email protected](not _is_linux, reason="Arena is Linux-only")
[email protected]("variant", _all_variants)
+def test_overflow_section_outside_arena(variant: str) -> None:
+    """Overflow sections (.nv_fatbin) are allocated outside the arena.
+
+    The arena memory manager detects sections named .nv_fatbin and
+    allocates them via a separate mmap() outside the arena.  This keeps
+    the arena compact for code + small rodata, reducing 2MB THP region
+    count and iTLB pressure.
+
+    Verification: get the fatbin data address and the arena VA range
+    from /proc/self/maps, then assert the fatbin address is NOT within
+    the arena region.
+    """
+    session = ExecutionSession(arena_size=_ARENA_SIZE)
+    lib = session.create_library("fatbin_overflow")
+    lib.add(obj(f"{variant}/fake_fatbin"))
+
+    # Verify the function still works correctly.
+    assert lib.get_function("get_fatbin_size")() == 4 * 1024 * 1024
+
+    # Get the actual address of the fatbin data in memory.
+    fatbin_addr = lib.get_function("get_fatbin_addr")()
+
+    # Find the arena mapping: a single large region matching the arena size.
+    # The arena is reserved as PROT_NONE and then committed in slabs, so
+    # look for the contiguous region that spans _ARENA_SIZE.
+    maps = _parse_maps()
+    arena_regions = [(s, e) for s, e in maps if (e - s) >= _ARENA_SIZE]
+
+    # The fatbin address must not fall within any arena-sized region.
+    for start, end in arena_regions:
+        assert not (start <= fatbin_addr < end), (
+            f"Fatbin data at {fatbin_addr:#x} should be OUTSIDE the arena "
+            f"[{start:#x}, {end:#x}) but landed inside"
+        )
+
+
+# ---------------------------------------------------------------------------
+# Test 6: __dso_handle Delta32 overflow after leaked materialization
+# ---------------------------------------------------------------------------
+
+
[email protected](not _is_linux, reason="ELF/GCC-specific __dso_handle test")
[email protected]("variant", _cpp_variants)
+def test_dso_handle_relocation_after_failed_materialization(variant: str) -> 
None:
+    """__dso_handle resolves correctly after leaked JIT memory.
+
+    Mechanism
+    ---------
+    GCC C++ objects call ``__cxa_atexit(&destructor, &obj, __dso_handle)``
+    for static-storage-duration objects.  The ``__dso_handle`` symbol is
+    emitted as ``GLOBAL HIDDEN UND`` in each object file.  LLVM's
+    ``ELFNixPlatform`` defines it per JITDylib via a separate
+    ``DSOHandleMaterializationUnit`` — a self-referential pointer block
+    (``void *__dso_handle = &__dso_handle;``) allocated in its own
+    ``LinkGraph`` through ``ObjectLinkingLayer``.
+
+    When a prior ``lib.add()`` fails (e.g., duplicate symbol), LLVM's
+    ``InProcessMemoryMapper`` leaks the mmap'd slab for that failed
+    materialization.  If the process holds references to the old session
+    (e.g., pytest keeping ``sys.exc_info()`` tracebacks alive), the
+    leaked slabs accumulate and push subsequent ``mmap`` allocations to
+    higher addresses.
+
+    The arena prevents overflow because all allocations — both
+    ``__dso_handle``'s ``LinkGraph`` and the code ``LinkGraph`` — land
+    within the same contiguous pre-reserved VA region.
+
+    Without arena: may FAIL on x86_64 with GCC PIE objects after
+                   repeated leaked materializations push slabs >2 GB
+                   apart.
+    With arena:    PASSES (all allocations in same arena).
+    """
+    # Step 1: Trigger leaked materializations to consume low VA space.
+    leaked_sessions = []
+    for _ in range(3):
+        s0 = ExecutionSession()
+        lib0 = s0.create_library("warmup")
+        lib0.add(obj(f"{variant}/test_funcs"))
+        lib0.get_function("test_add")(10, 20)
+        try:
+            lib0.add(obj(f"{variant}/test_funcs_conflict"))
+        except Exception:
+            pass
+        leaked_sessions.append((s0, lib0))
+
+    # Step 2: Fresh session — cross-library resolution must still work.
+    session = ExecutionSession()
+    lib1 = session.create_library("lib1")
+    lib1.add(obj(f"{variant}/test_funcs"))
+    assert lib1.get_function("test_add")(10, 20) == 30
+
+    lib2 = session.create_library("lib2")
+    lib2.add(obj(f"{variant}/test_funcs_conflict"))
+    assert lib2.get_function("test_add")(10, 20) == 1030
+
+
+# ---------------------------------------------------------------------------
+# Test 6: __dso_handle Delta32 overflow — arena prevents it (x86_64 PIE)
+#
+# GCC -fpie objects use R_X86_64_PC32 (±2GB) for __dso_handle.
+# ELFNixPlatform's DSOHandleMaterializationUnit allocates __dso_handle
+# in a separate LinkGraph from the code.  Under VA pressure, these two
+# allocations can land >2GB apart, overflowing the Delta32 fixup.
+# The arena keeps them co-located within relocation range.
+# ---------------------------------------------------------------------------
+
+
[email protected](not _is_linux, reason="Arena is Linux-only")
[email protected](not _is_x86_64, reason="Delta32 overflow requires x86_64")
[email protected](not _pie_cpp_variants, reason="No GCC PIE C++ variants 
built")
[email protected]("variant", _pie_cpp_variants or ["skip"])
+def test_dso_handle_delta32_with_arena(variant: str) -> None:
+    """Arena prevents __dso_handle Delta32 overflow under VA pressure.
+
+    Root cause
+    ----------
+    GCC C++ objects built with ``-fpie`` emit ``R_X86_64_PC32`` (Delta32,
+    ±2 GB) relocations for ``__dso_handle`` because the symbol has hidden
+    visibility and ``-fpie`` prefers direct PC-relative over GOT-relative.
+    (With ``-fPIC``, GCC uses ``R_X86_64_GOTPCRELX`` which goes through
+    the GOT — always co-located with code, so no range issue.)
+
+    ``ELFNixPlatform`` defines ``__dso_handle`` per JITDylib in a separate
+    ``DSOHandleMaterializationUnit``.  This creates a tiny ``LinkGraph``
+    (a self-referential pointer: ``void *__dso_handle = &__dso_handle;``)
+    that is allocated through ``ObjectLinkingLayer`` independently of the
+    code ``LinkGraph`` from ``lib.add()``.  Both allocations go through
+    ``InProcessMemoryMapper`` → ``mmap(MAP_ANONYMOUS)``, whose placement
+    the kernel decides.
+
+    Test strategy
+    -------------
+    1. Create a session with arena enabled (16 MB).
+    2. Load PIE GCC objects into lib1 — this triggers materialization of
+       both ``__dso_handle`` (via ``DSOHandleMaterializationUnit``) and
+       the code (via ``lib.add``), all within the arena.
+    3. Block 3 GB of VA around the first allocation — without arena this
+       would force the next ``mmap`` to land >2 GB away.
+    4. Load a second PIE GCC object into lib2 — with arena, this still
+       lands within the 16 MB region.
+    5. Assert the function call succeeds — proves Delta32 is in range.
+
+    See ``test_dso_handle_delta32_overflow_without_arena`` for the
+    counterpart proving the overflow occurs without arena.
+    """
+    maps_before = set(_parse_maps())
+
+    session = ExecutionSession(arena_size=_ARENA_SIZE)
+    lib1 = session.create_library("lib1")
+    lib1.add(obj(f"{variant}/test_funcs"))
+    assert lib1.get_function("test_add")(10, 20) == 30
+
+    # Block 3GB of VA around the first allocation to force scatter
+    maps_after = _parse_maps()
+    new_maps = _find_new_mappings(maps_before, maps_after)
+    jit_center = max(s for s, e in new_maps) if new_maps else 0xFFFF00000000
+
+    blockers = block_nearby_va(jit_center, radius=_DSO_BLOCK_RADIUS)
+    try:
+        lib2 = session.create_library("lib2")
+        lib2.add(obj(f"{variant}/test_funcs_conflict"))
+        assert lib2.get_function("test_add")(10, 20) == 1030
+    finally:
+        free_blockers(blockers)
+
+
[email protected](not _is_linux, reason="Arena is Linux-only")
[email protected](not _is_x86_64, reason="Delta32 overflow requires x86_64")
[email protected](not _pie_cpp_variants, reason="No GCC PIE C++ variants 
built")
[email protected]("variant", _pie_cpp_variants or ["skip"])
+def test_dso_handle_delta32_overflow_without_arena(variant: str) -> None:
+    """Without arena, PIE __dso_handle PC32 overflows under VA pressure.
+
+    Same setup as ``test_dso_handle_delta32_with_arena`` but with arena
+    disabled (``arena_size=-1``).
+
+    The 3 GB VA blocker fills all free gaps within ±3 GB of the first
+    session's JIT allocations.  When lib2 is loaded, ``InProcessMemoryMapper``
+    calls ``mmap(MAP_ANONYMOUS)`` for a new slab, but the only free VA is
+    >3 GB away.  The code ``LinkGraph`` from ``lib2.add()`` lands in that
+    distant slab, while ``__dso_handle`` was already materialized with
+    lib1's ``DSOHandleMaterializationUnit`` in the original region.  The
+    ``R_X86_64_PC32`` fixup from code to ``__dso_handle`` now exceeds
+    ±2 GB → JITLink reports ``Delta32 fixup ... is out of range``.
+
+    The test accepts both outcomes:
+    - **Exception** (PC32 overflow): proves the arena is needed.
+    - **Success** (GOTPCRELX used): GCC chose GOT-relative despite
+      ``-fpie`` — no overflow possible, but the arena is still
+      beneficial for other relocation types.
+    """
+    maps_before = set(_parse_maps())
+
+    session = ExecutionSession(arena_size=-1)  # arena disabled
+    lib1 = session.create_library("lib1")
+    lib1.add(obj(f"{variant}/test_addr"))
+    lib1.get_function("code_address")()
+
+    maps_after = _parse_maps()
+    new_maps = _find_new_mappings(maps_before, maps_after)
+    jit_center = max(s for s, e in new_maps) if new_maps else 0xFFFF00000000
+
+    blockers = block_nearby_va(jit_center, radius=_DSO_BLOCK_RADIUS)
+    try:
+        lib2 = session.create_library("lib2")
+        try:
+            lib2.add(obj(f"{variant}/test_funcs_conflict"))
+            result = lib2.get_function("test_add")(10, 20)
+            # If we get here, GCC used GOTPCRELX — no overflow.
+            assert result == 1030
+        except Exception:
+            # R_X86_64_PC32 overflow as expected — proves arena is needed.
+            pass
+    finally:
+        free_blockers(blockers)
diff --git a/addons/tvm_ffi_orcjit/tests/utils.py 
b/addons/tvm_ffi_orcjit/tests/utils.py
index e1fbc96..2a90ff2 100644
--- a/addons/tvm_ffi_orcjit/tests/utils.py
+++ b/addons/tvm_ffi_orcjit/tests/utils.py
@@ -58,12 +58,20 @@ def _extra_cflags() -> list[str]:
     return []
 
 
+def _extra_cuda_cflags() -> list[str]:
+    machine = platform.machine()
+    if machine in ("aarch64", "arm64"):
+        return ["-Xcompiler", "-mno-outline-atomics"]
+    return []
+
+
 def _build_objects(
     src_dir: Path,
     out_dir: Path,
     *,
     ext_glob: str,
     extra_cflags: list[str],
+    extra_cuda_cflags: list[str] | None = None,
 ) -> None:
     """Compile all sources in *src_dir* to object files in *out_dir*."""
     out_dir.mkdir(parents=True, exist_ok=True)
@@ -78,6 +86,7 @@ def _build_objects(
             sources=[str(src)],
             output=f"{src.stem}.o",
             extra_cflags=extra_cflags,
+            extra_cuda_cflags=extra_cuda_cflags or [],
             build_directory=str(build_dir),
         )
         shutil.copy2(obj_path, dest)
@@ -183,6 +192,17 @@ def build_test_objects(out_dir: Path | None = None) -> 
Path:
                 c_outdir=out_dir / "c-gcc",
                 cc_outdir=out_dir / "cc-gcc",
             )
+            # PIE variant: -fpie forces R_X86_64_PC32 for hidden-visibility
+            # externals like __dso_handle (instead of GOTPCRELX with -fPIC).
+            # Used to reproduce __dso_handle Delta32 overflow on x86_64.
+            _build_variant(
+                "GCC (PIE)",
+                cc=None,
+                cxx="g++",
+                extra_cflags=[*extra, "-fpie"],
+                c_outdir=out_dir / "c-gcc-pie",
+                cc_outdir=out_dir / "cc-gcc-pie",
+            )
         if system == "Darwin" and Path("/usr/bin/clang").exists():
             _build_variant(
                 "Apple Clang",
@@ -227,6 +247,12 @@ def build_test_objects(out_dir: Path | None = None) -> 
Path:
 
     # CUDA (platform-independent, uses nvcc)
     if shutil.which("nvcc"):
-        _build_objects(SOURCES_CUDA, out_dir / "cuda", ext_glob="*.cu", 
extra_cflags=[])
+        _build_objects(
+            SOURCES_CUDA,
+            out_dir / "cuda",
+            ext_glob="*.cu",
+            extra_cflags=[],
+            extra_cuda_cflags=_extra_cuda_cflags(),
+        )
 
     return out_dir

(tvm-ffi) branch main updated: [OrcJIT] Arena JITLinkMemoryManager with GOTPCRELX fix (Linux) (#527)

Reply via email to