jerryshao opened a new issue, #10474:
URL: https://github.com/apache/gravitino/issues/10474
### What would you like to be improved?
#### Background
Gravitino uses `TreeLock` (implemented in
`core/src/main/java/org/apache/gravitino/lock/`) to ensure consistency and
atomicity of metadata operations. The lock follows a hierarchical read-write
locking strategy — it acquires read locks on all ancestor nodes and a
write/read lock on the target node itself. For example, renaming a table
`metalake.catalog.db.table1` acquires:
```
/ → readLock
/metalake → readLock
/metalake/catalog → readLock
/metalake/catalog/db → writeLock (parent of the renamed resource)
```
This design is managed by `LockManager`, which maintains an in-memory tree
of `TreeLockNode`s backed by Java's `ReadWriteLock`.
#### Problem
`TreeLock` is **entirely in-process**. Each Gravitino server instance has
its own `LockManager` with its own isolated lock tree in JVM memory. This
fundamentally breaks down in **High Availability (HA) mode**, where multiple
Gravitino server instances operate behind a load balancer:
1. **No cross-process coordination**: A write lock acquired on server A has
zero visibility to server B. Two servers can concurrently modify the same
metadata resource (e.g., both renaming the same table), causing data races and
inconsistency.
2. **Race conditions on shared storage**: The backend metadata store
(relational database) is shared across all nodes, but the locking layer
protecting it is per-node, so conflicting writes can corrupt metadata state.
3. **Dead-lock detection is per-JVM**: The built-in dead-lock checker in
`LockManager` only inspects threads within the local JVM; cross-node lock
contention is entirely invisible.
#### Scope of Impact
`TreeLockUtils.doWithTreeLock(...)` is called across the majority of
managers:
- `CatalogManager`, `MetalakeManager`
- `TableOperationDispatcher`, `SchemaOperationDispatcher`,
`FilesetOperationDispatcher`, `TopicOperationDispatcher`,
`ViewOperationDispatcher`, `ModelOperationDispatcher`,
`PartitionOperationDispatcher`
- `TagManager`, `PolicyManager`, `StatisticManager`, `AccessControlManager`,
`PermissionManager`, `OwnerManager`
Any metadata operation in these managers is potentially unsafe under
concurrent HA traffic.
---
### How should we improve?
Two broad approaches are worth evaluating. Each has significant trade-offs:
**Option A — Distributed Lock**
Replace or wrap `TreeLock` with a distributed locking mechanism (e.g.,
ZooKeeper, etcd, or Redis-based). This preserves the existing hierarchical lock
semantics and requires minimal changes to call sites.
- Pros: Semantically equivalent to current model; relatively bounded change
surface.
- Cons: Introduces an external dependency; adds network round-trips on every
metadata operation; requires handling distributed lock lease expiry, fencing
tokens, and failure scenarios; potential new SPOF.
**Option B — Remove TreeLock; rely on storage-level consistency**
Eliminate `TreeLock` entirely and push concurrency control into the storage
layer using optimistic locking (version-based CAS), database-level row locks,
or serializable transactions.
- Pros: No external dependency; naturally correct in HA because all nodes
share the same storage.
- Cons: Significant refactoring across all managers; requires retry logic
for conflicts; may increase database load; harder to express coarse-grained
parent-level write locks (e.g., "lock the whole schema while renaming a table").
We welcome any other solution proposals beyond the two options above, as
long as they address the HA consistency problem correctly and efficiently.
---
### Expected outcome
A concrete solution proposal covering:
1. Detailed analysis of the current `TreeLock` implementation and its
limitations in HA mode.
2. Proposed solution with:
- Design overview and key components/interfaces to change.
- Analysis of correctness (does it fully eliminate the race conditions?).
- Analysis of performance impact.
- Migration strategy (how to transition from the current design without
breaking existing behavior).
3. Optionally, a proof-of-concept implementation or pseudocode for the
critical path.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]