[
https://issues.apache.org/jira/browse/HBASE-29975?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Work on HBASE-29975 started by Andrew Kyle Purtell.
---------------------------------------------------
> TLA+ specification of the AssignmentManager
> -------------------------------------------
>
> Key: HBASE-29975
> URL: https://issues.apache.org/jira/browse/HBASE-29975
> Project: HBase
> Issue Type: Task
> Components: master, proc-v2, Region Assignment
> Reporter: Andrew Kyle Purtell
> Assignee: Andrew Kyle Purtell
> Priority: Major
>
> TLA+ is a formal specification language for designing and verifying
> concurrent and distributed systems. The name stands for the "Temporal Logic
> of Actions", a mathematical framework that combines first-order logic with
> temporal operators to reason about how system state evolves over time. When
> the model is a high-fidelity representation of the real system, proposed
> design and architectural changes can be checked against the full space of
> possible executions, surfacing logic bugs at design time, before any code is
> written. This can save weeks or months of development effort that would
> otherwise be spent discovering and debugging subtle concurrency issues in a
> running system.
> A TLA+ specification describes a system as a state machine, an initial state
> predicate (Init), a next-state relation (Next) that defines every legal
> transition, and a collection of invariants, properties that must hold in
> every reachable state. The TLC model checker then systematically explores
> every possible execution of this state machine, checking each property at
> every state. If a property is violated, TLC produces a minimal counterexample
> trace showing the exact sequence of steps that led to the failure.
> The HBase AssignmentManager is a core component of the HBase master that
> manages the lifecycle of regions across a cluster of RegionServers. It
> coordinates region assignment, unassignment, moves, and reopens, handles
> RegionServer crashes through the ServerCrashProcedure (SCP), and recovers its
> own state after a master crash through a durable procedure store. The
> correctness of these interactions is critical.
> This TLA+ specification models the AssignmentManager as a state machine
> capturing:
> * Region lifecycle
> * Asynchronous RPC channels: Master-to-RS commands and RS-to-master
> transition reports
> * Procedure state: State records (type, step, target server, retry count),
> with a durable procedure store that survives master crashes
> * Server liveness: Per-server online/crashed state, ZooKeeper ephemeral
> nodes, and WAL fencing state
> * Crash recovery: Multi-step ServerCrashProcedure
> * PEWorker thread pool
> The specification defines safety invariants verified at every reachable
> state, including NoDoubleAssignment (no region writable on two servers),
> MetaConsistency (persistent and in-memory state agree), FencingOrder (WALs
> fenced before reassignment), NoLostRegions (no region stuck without a
> procedure after crash recovery), and NoPEWorkerDeadlock (thread pool
> exhaustion detection). The liveness property (MetaEventuallyAssigned)
> verifies that hbase:meta is eventually reassigned after a crash. Action
> constraints enforce transition validity and SCP monotonicity.
> The model is under active development and can found at:
> https://github.com/apurtell/hbase/tree/WORK-architecture/src/main/spec
--
This message was sent by Atlassian Jira
(v8.20.10#820010)