[I] [Improvement] Address TreeLock limitations for Gravitino HA deployment [gravitino]

via GitHub Wed, 18 Mar 2026 20:30:51 -0700


jerryshao opened a new issue, #10474:
URL: https://github.com/apache/gravitino/issues/10474


   ### What would you like to be improved?
   
   #### Background
   
   Gravitino uses `TreeLock` (implemented in 
`core/src/main/java/org/apache/gravitino/lock/`) to ensure consistency and 
atomicity of metadata operations. The lock follows a hierarchical read-write 
locking strategy — it acquires read locks on all ancestor nodes and a 
write/read lock on the target node itself. For example, renaming a table 
`metalake.catalog.db.table1` acquires:
   
   ```
   /                         → readLock
   /metalake                 → readLock
   /metalake/catalog         → readLock
   /metalake/catalog/db      → writeLock   (parent of the renamed resource)
   ```
   
   This design is managed by `LockManager`, which maintains an in-memory tree 
of `TreeLockNode`s backed by Java's `ReadWriteLock`.
   
   #### Problem
   
   `TreeLock` is **entirely in-process**. Each Gravitino server instance has 
its own `LockManager` with its own isolated lock tree in JVM memory. This 
fundamentally breaks down in **High Availability (HA) mode**, where multiple 
Gravitino server instances operate behind a load balancer:
   
   1. **No cross-process coordination**: A write lock acquired on server A has 
zero visibility to server B. Two servers can concurrently modify the same 
metadata resource (e.g., both renaming the same table), causing data races and 
inconsistency.
   2. **Race conditions on shared storage**: The backend metadata store 
(relational database) is shared across all nodes, but the locking layer 
protecting it is per-node, so conflicting writes can corrupt metadata state.
   3. **Dead-lock detection is per-JVM**: The built-in dead-lock checker in 
`LockManager` only inspects threads within the local JVM; cross-node lock 
contention is entirely invisible.
   
   #### Scope of Impact
   
   `TreeLockUtils.doWithTreeLock(...)` is called across the majority of 
managers:
   
   - `CatalogManager`, `MetalakeManager`
   - `TableOperationDispatcher`, `SchemaOperationDispatcher`, 
`FilesetOperationDispatcher`, `TopicOperationDispatcher`, 
`ViewOperationDispatcher`, `ModelOperationDispatcher`, 
`PartitionOperationDispatcher`
   - `TagManager`, `PolicyManager`, `StatisticManager`, `AccessControlManager`, 
`PermissionManager`, `OwnerManager`
   
   Any metadata operation in these managers is potentially unsafe under 
concurrent HA traffic.
   
   ---
   
   ### How should we improve?
   
   Two broad approaches are worth evaluating. Each has significant trade-offs:
   
   **Option A — Distributed Lock**
   Replace or wrap `TreeLock` with a distributed locking mechanism (e.g., 
ZooKeeper, etcd, or Redis-based). This preserves the existing hierarchical lock 
semantics and requires minimal changes to call sites.
   
   - Pros: Semantically equivalent to current model; relatively bounded change 
surface.
   - Cons: Introduces an external dependency; adds network round-trips on every 
metadata operation; requires handling distributed lock lease expiry, fencing 
tokens, and failure scenarios; potential new SPOF.
   
   **Option B — Remove TreeLock; rely on storage-level consistency**
   Eliminate `TreeLock` entirely and push concurrency control into the storage 
layer using optimistic locking (version-based CAS), database-level row locks, 
or serializable transactions.
   
   - Pros: No external dependency; naturally correct in HA because all nodes 
share the same storage.
   - Cons: Significant refactoring across all managers; requires retry logic 
for conflicts; may increase database load; harder to express coarse-grained 
parent-level write locks (e.g., "lock the whole schema while renaming a table").
   
   We welcome any other solution proposals beyond the two options above, as 
long as they address the HA consistency problem correctly and efficiently.
   
   ---
   
   ### Expected outcome
   
   A concrete solution proposal covering:
   
   1. Detailed analysis of the current `TreeLock` implementation and its 
limitations in HA mode.
   2. Proposed solution with:
      - Design overview and key components/interfaces to change.
      - Analysis of correctness (does it fully eliminate the race conditions?).
      - Analysis of performance impact.
      - Migration strategy (how to transition from the current design without 
breaking existing behavior).
   3. Optionally, a proof-of-concept implementation or pseudocode for the 
critical path.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] [Improvement] Address TreeLock limitations for Gravitino HA deployment [gravitino]

Reply via email to