Re: [VOTE] PIP-454: Metadata Store Migration Framework

mattison chao Tue, 24 Feb 2026 10:04:28 -0800

+1 (binding)



On Sun, 8 Feb 2026 at 17:17, Tao Jiuming <[email protected]> wrote:

> +1 nonbinding
>
> Matteo Merli <[email protected]>于2026年2月6日 周五01:43写道：
>
> > PIP PR: https://github.com/apache/pulsar/pull/25196
> >
> > PR with implementation: https://github.com/apache/pulsar/pull/25219
> >
> > ----
> >
> > # PIP-454: Metadata Store Migration Framework
> >
> > ## Motivation
> >
> > Apache Pulsar currently uses Apache ZooKeeper as its metadata store
> > for broker coordination, topic metadata, namespace policies, and
> > BookKeeper ledger management. While ZooKeeper has served well, there
> > are several motivations for enabling migration to alternative metadata
> > stores:
> >
> > 1. **Operational Simplicity**: Alternative metadata stores like Oxia
> > may offer simpler operations, better observability, or reduced
> > operational overhead compared to ZooKeeper ensembles.
> >
> > 2. **Performance Characteristics**: Different metadata stores have
> > different performance profiles. Some workloads may benefit from stores
> > optimized for high throughput or low latency.
> >
> > 3. **Deployment Flexibility**: Organizations may prefer metadata
> > stores that align better with their existing infrastructure and
> > expertise.
> >
> > 4. **Zero-Downtime Migration**: Operators need a safe, automated way
> > to migrate metadata between stores without service interruption.
> >
> > Currently, there is no supported path to migrate from one metadata
> > store to another without cluster downtime. This PIP proposes a **safe,
> > simple migration framework** that ensures metadata consistency by
> > avoiding complex dual-write/dual-read patterns. The framework enables:
> >
> > - **Zero-downtime migration** from any metadata store to any other
> > supported store
> > - **Automatic ephemeral node recreation** in the target store
> > - **Version preservation** to ensure conditional writes continue working
> > - **Automatic failure recovery** if issues are detected
> > - **Minimal configuration changes** - no config updates needed until
> > after migration completes
> >
> > ## Goal
> >
> > Provide a safe, automated framework for migrating Apache Pulsar's
> > metadata from one store implementation (e.g., ZooKeeper) to another
> > (e.g., Oxia) with zero service interruption.
> >
> > ### In Scope
> >
> > - Migration framework supporting any source → any target metadata store
> > - Automatic ephemeral node recreation by brokers and bookies
> > - Persistent data copy with version preservation
> > - CLI commands for migration control and monitoring
> > - Automatic failure recovery during migration
> > - Support for broker and bookie participation
> > - Read-only mode during migration for consistency
> >
> > ### Out of Scope
> >
> > - Developing new metadata store implementations (Oxia, Etcd support
> > already exists)
> > - Cross-cluster metadata synchronization (different use case)
> > - Automated rollback after COMPLETED phase (requires manual intervention)
> > - Migration of configuration metadata store and geo-replicated
> > clusters (can be done separately)
> >
> > ## High Level Design
> >
> > The migration framework introduces a **DualMetadataStore** wrapper
> > that transparently handles migration without modifying existing
> > metadata store implementations.
> >
> > ### Key Principles
> >
> > 1. **Transparent Wrapping**: The `DualMetadataStore` wraps the
> > existing source store (e.g., `ZKMetadataStore`) without modifying its
> > implementation.
> >
> > 2. **Lazy Target Initialization**: The target store is only
> > initialized when migration begins, triggered by a flag in the source
> > store.
> >
> > 3. **Ephemeral-First Approach**: Before copying persistent data, all
> > brokers and bookies recreate their ephemeral nodes in the target
> > store. This ensures the cluster is "live" in both stores during
> > migration.
> >
> > 4. **Read-Only Mode During Migration**: To ensure consistency, all
> > metadata writes are blocked during PREPARATION and COPYING phases.
> > Components receive `SessionLost` events to defer non-critical
> > operations (e.g., ledger rollovers).
> >
> > 5. **Phase-Based Migration**: Migration proceeds through well-defined
> > phases (PREPARATION → COPYING → COMPLETED).
> >
> > 6. **Generic Framework**: The framework is agnostic to specific store
> > implementations - it works with any source and target that implement
> > the `MetadataStore` interface.
> >
> > 7. **Guaranteed Consistency**: By blocking writes during migration and
> > using atomic copy, metadata is **always in a consistent state**. No
> > dual-write complexity, no data divergence, no consistency issues.
> >
> > ## Detailed Design
> >
> > ### Migration Phases
> >
> > ```
> > NOT_STARTED
> >      ↓
> > PREPARATION ← All brokers/bookies recreate ephemeral nodes in target
> >              ← Metadata writes are BLOCKED (read-only mode)
> >      ↓
> > COPYING ← Coordinator copies persistent data source → target
> >          ← Metadata writes still BLOCKED
> >      ↓
> > COMPLETED ← Migration complete, all services using target store
> >           ← Metadata writes ENABLED on target
> >      ↓
> > After validation period:
> >  * Update config and restart brokers & bookies
> >  * Decommission source store
> >
> > (If errors occur):
> > FAILED ← Rollback to source store, writes ENABLED
> > ```
> >
> > ### Phase 1: NOT_STARTED → PREPARATION
> >
> > **Participant Registration (at startup):**
> > Each broker and bookie registers itself as a migration participant by
> > creating a sequential ephemeral node:
> > - Path: `/pulsar/migration-coordinator/participants/id-NNNN` (sequential)
> > - This allows the coordinator to know how many participants exist
> > before migration starts
> >
> > **Administrator triggers migration:**
> > ```bash
> > pulsar-admin metadata-migration start --target oxia://oxia1:6648
> > ```
> >
> > **Coordinator actions:**
> > 1. Creates migration flag in source store:
> > `/pulsar/migration-coordinator/migration`
> >    ```json
> >    {
> >      "phase": "PREPARATION",
> >      "targetUrl": "oxia://oxia1:6648"
> >    }
> >    ```
> >
> > **Broker/Bookie actions (automatic, triggered by watching the flag):**
> > 1. Detect migration flag via watch on
> > `/pulsar/migration-coordinator/migration`
> > 2. Defer non-critical metadata writes (e.g., ledger rollovers, bundle
> > ownership changes)
> > 3. Initialize connection to target store
> > 4. Recreate ALL ephemeral nodes in target store
> > 5. **Delete** participant registration node to signal "ready"
> >
> > **Coordinator waits for all participant nodes to be deleted
> > (indicating all participants are ready)**
> >
> > ### Phase 2: PREPARATION → COPYING
> >
> > **Coordinator actions:**
> > 1. Updates phase to `COPYING`
> > 2. Performs recursive copy of persistent data from source → target:
> >    - Skips ephemeral nodes (already recreated)
> >    - Concurrent operations limited by semaphore (default: 1000 pending
> ops)
> >    - Breadth-first traversal to process all paths
> >    - Progress logged periodically
> >
> > **During this phase:**
> > - Brokers/bookies continue normal READ operations
> > - Metadata WRITES are BLOCKED (return failure)
> > - Ephemeral nodes remain alive in both stores
> > - All reads still go to source store
> >
> > **During this phase:**
> > - Metadata writes are BLOCKED (return error to clients)
> > - Metadata reads continue normally from source store
> > - **Data plane operations unaffected**: Publish/consume, ledger writes
> > continue normally
> > - Version-id and modification count preserved using direct Oxia client
> > - Breadth-first traversal with max 1000 concurrent operations
> >
> > **Estimated duration:**
> > - **< 30 seconds** for typical deployments with up to **500 MB of
> > metadata** in ZooKeeper
> >
> > **Impact on operations:**
> > - ✅ Existing topics: Publish and consume continue without interruption
> > - ✅ BookKeeper: Ledger writes and reads continue normally
> > - ✅ Clients: Connected producers and consumers unaffected
> > - ❌ Admin operations: Topic/namespace creation blocked temporarily
> > - ❌ Bundle operations: Load balancing deferred until completion
> >
> > ### Phase 3: COPYING → COMPLETED
> >
> > **Coordinator actions:**
> > 1. Updates phase to `COMPLETED`
> > 2. Logs success message with total copied node count
> >
> > **Broker/Bookie actions (automatic, triggered by phase update):**
> > 1. Detect `COMPLETED` phase
> > 2. Deferred operations can now proceed
> > 3. Switch routing:
> >    - **Writes**: Go to target store only
> >    - **Reads**: Go to target store only
> >
> > **At this point:**
> > - Cluster is running on target store
> > - Source store remains available for safety
> > - Metadata writes are enabled again
> >
> > **Operator follow-up (after validation period):**
> > 1. Update configuration files:
> >    ```properties
> >    # Before (ZooKeeper):
> >    metadataStoreUrl=zk://zk1:2181,zk2:2181/pulsar
> >
> >    # After (Oxia):
> >    metadataStoreUrl=oxia://oxia1:6648
> >    ```
> > 2. Perform rolling restart with new config
> > 3. After all services restarted, decommission source store
> >
> > ### Failure Handling: ANY_PHASE → FAILED
> >
> > **If migration fails at any point:**
> > 1. Coordinator updates phase to `FAILED`
> > 2. Broker/Bookie actions:
> >    - Detect `FAILED` phase
> >    - Discard target store connection
> >    - Continue using source store
> >    - Metadata writes enabled again
> >
> > **Operator actions:**
> > 1. Review logs to understand failure cause
> > 2. Fix underlying issue
> > 3. Retry migration with `pulsar-admin metadata-migration start --target
> > <url>`
> >
> > ## Implementation Details
> >
> >
> > ### Key Implementation Details:
> >
> > 1. **Direct Oxia Client Usage**: The coordinator uses
> > `AsyncOxiaClient` directly instead of going through `MetadataStore`
> > interface. This allows setting version-id and modification count to
> > match the source values, ensuring conditional writes (compare-and-set
> > operations) continue to work correctly after migration.
> >
> > 2. **Breadth-First Traversal**: Processes paths level by level using a
> > work queue, enabling high concurrency while preventing deep recursion.
> >
> > 3. **Concurrent Operations**: Uses a semaphore to limit pending
> > operations (default: 1000), balancing throughput with memory usage.
> >
> > ### Data Structures
> >
> > **Migration State** (`/pulsar/migration-coordinator/migration`):
> > ```json
> > {
> >   "phase": "PREPARATION",
> >   "targetUrl": "oxia://oxia1:6648/default"
> > }
> > ```
> >
> > Fields:
> > - `phase`: Current migration phase (NOT_STARTED, PREPARATION, COPYING,
> > COMPLETED, FAILED)
> > - `targetUrl`: Target metadata store URL (e.g.,
> > `oxia://oxia1:6648/default`)
> >
> > **Participant Registration**
> > (`/pulsar/migration-coordinator/participants/id-NNNN`):
> > - Sequential ephemeral node created by each broker/bookie at startup
> > - Empty data (presence indicates participation)
> > - Deleted by participant when preparation complete (signals "ready")
> > - Coordinator waits for all to be deleted before proceeding to COPYING
> > phase
> >
> > **No additional state tracking**: The simplified design removes
> > complex state tracking and checksums. Migration state is kept minimal.
> >
> > ### CLI Commands
> >
> > ```bash
> > # Start migration
> > pulsar-admin metadata-migration start --target <target-url>
> >
> > # Check status
> > pulsar-admin metadata-migration status
> > ```
> >
> > The simplified design only requires two commands. Rollback happens
> > automatically if migration fails (phase transitions to FAILED).
> >
> > ### REST API
> >
> > ```
> > POST   /admin/v2/metadata/migration/start
> >        Body: { "targetUrl": "oxia://..." }
> >
> > GET    /admin/v2/metadata/migration/status
> >        Returns: { "phase": "COPYING", "targetUrl": "oxia://..." }
> > ```
> >
> > ## Safety Guarantees
> >
> > ### Why This Approach is Safe
> >
> > **The migration design guarantees metadata consistency by avoiding
> > dual-write and dual-read patterns entirely:**
> >
> > 1. **Single Source of Truth**: At any given time, there is exactly ONE
> > active metadata store:
> >    - Before migration: Source store (ZooKeeper)
> >    - During PREPARATION and COPYING: Source store (read-only)
> >    - After COMPLETED: Target store (Oxia)
> >
> > 2. **No Dual-Write Complexity**: Unlike approaches that write to both
> > stores simultaneously, this design eliminates:
> >    - Write synchronization issues
> >    - Conflict resolution between stores
> >    - Data divergence problems
> >    - Partial failure handling complexity
> >
> > 3. **No Dual-Read Complexity**: Unlike approaches that read from both
> > stores, this design eliminates:
> >    - Read consistency issues
> >    - Cache invalidation across stores
> >    - Stale data problems
> >    - Complex fallback logic
> >
> > 4. **Atomic Cutover**: All participants switch stores simultaneously
> > when COMPLETED phase is detected. There is no ambiguous state where
> > some participants use one store and others use another.
> >
> > 5. **Fast Migration Window**: With **< 30 seconds** for typical
> > metadata sizes (even up to 500 MB), the read-only window is minimal
> > and acceptable for most production environments.
> >
> > **Bottom line**: Metadata is **always in a consistent state** - either
> > fully in the source store or fully in the target store, never split or
> > diverged between them.
> >
> > ### Data Integrity
> >
> > 1. **Version Preservation**: All persistent data is copied with
> > original version-id and modification count preserved. This ensures
> > conditional writes (compare-and-set operations) continue working after
> > migration.
> >
> > 2. **Ephemeral Node Recreation**: All ephemeral nodes are recreated by
> > their owning brokers/bookies before persistent data copy begins.
> >
> > 3. **Read-Only Mode**: All metadata writes are blocked during
> > PREPARATION and COPYING phases, ensuring no data inconsistencies
> > during migration.
> >
> >    **Important**: Read-only mode only affects metadata operations.
> > Data plane operations continue normally:
> >    - ✅ **Publishing and consuming messages** works without interruption
> >    - ✅ **Reading from existing topics and subscriptions** works normally
> >    - ✅ **Ledger writes to BookKeeper** continue unaffected
> >    - ❌ **Creating new topics or subscriptions** will be blocked
> temporarily
> >    - ❌ **Namespace/policy updates** will be blocked temporarily
> >    - ❌ **Bundle ownership changes** will be deferred until migration
> > completes
> >
> > ### Operational Safety
> >
> > 1. **No Downtime**: Brokers and bookies remain online throughout the
> > migration. **Data plane operations (publish/consume) continue without
> > interruption.** Only metadata operations are temporarily blocked
> > during the migration phases.
> >
> > 2. **Graceful Failure**: If migration fails at any point, phase
> > transitions to FAILED and cluster returns to source store
> > automatically.
> >
> > 3. **Session Events**: Components receive `SessionLost` event during
> > migration to defer non-critical writes (e.g., ledger rollovers), and
> > `SessionReestablished` when migration completes or fails.
> >
> > 4. **Participant Coordination**: Migration waits for all participants
> > to complete preparation before copying data.
> >
> > ### Consistency
> >
> > 1. **Atomic Cutover**: All participants switch to target store
> > simultaneously when COMPLETED phase is detected.
> >
> > 2. **Ephemeral Session Consistency**: Each participant manages its own
> > ephemeral nodes in target store with proper session management.
> >
> > 3. **No Dual-Write Complexity**: By blocking writes during migration,
> > we avoid complex dual-write error handling and data divergence issues.
> >
> > ## Configuration
> >
> > ### No Configuration Changes for Migration
> >
> > The beauty of this design is that **no configuration changes are
> > needed to start migration**:
> >
> > - Brokers and bookies continue using their existing `metadataStoreUrl`
> > config
> > - The `DualMetadataStore` wrapper is automatically applied when using
> > ZooKeeper
> > - Target URL is provided only when triggering migration via CLI
> >
> > ### Post-Migration Configuration
> >
> > After migration completes and validation period ends, update config
> files:
> >
> > ```properties
> > # Before migration
> > metadataStoreUrl=zk://zk1:2181,zk2:2181,zk3:2181/pulsar
> >
> > # After migration (update and rolling restart)
> > metadataStoreUrl=oxia://oxia1:6648
> > ```
> >
> > ## Comparison with Kafka's ZooKeeper → KRaft Migration
> >
> > Apache Kafka faced a similar challenge migrating from ZooKeeper to
> > KRaft (Kafka Raft). Their approach provides useful comparison points:
> >
> > ### Kafka's Approach (KIP-866)
> >
> > **Migration Strategy:**
> > - **Dual-mode operation**: Kafka brokers run in a hybrid mode where
> > the KRaft controller reads from ZooKeeper
> > - **Metadata synchronization**: KRaft controller actively mirrors
> > metadata from ZooKeeper to KRaft
> > - **Phased cutover**: Operators manually transition from ZK_MIGRATION
> > mode to KRAFT mode
> > - **Write forwarding**: During migration, metadata writes go to
> > ZooKeeper and are replicated to KRaft
> >
> > **Timeline:**
> > - Migration can take hours or days as metadata is continuously
> synchronized
> > - Requires careful monitoring of lag between ZooKeeper and KRaft
> > - Rollback possible until final KRAFT mode is committed
> >
> > ### Pulsar's Approach (This PIP)
> >
> > **Migration Strategy:**
> > - **Transparent wrapper**: DualMetadataStore wraps existing store
> > without broker code changes
> > - **Read-only migration**: Metadata writes blocked during migration (<
> > 30 seconds for most clusters)
> > - **Atomic copy**: All persistent data copied in one operation with
> > version preservation
> > - **Single source of truth**: No dual-write or dual-read - metadata
> > always consistent
> > - **Automatic cutover**: All participants switch simultaneously when
> > COMPLETED phase detected
> >
> > **Timeline:**
> > - Migration completes in **< 30 seconds** for typical deployments
> > (even up to 500 MB metadata)
> > - No lag monitoring needed
> > - Automatic rollback on failure (FAILED phase)
> >
> > ### Key Differences
> >
> > | Aspect | Kafka (ZK → KRaft) | Pulsar (ZK → Oxia) |
> > |--------|-------------------|-------------------|
> > | **Migration Duration** | Hours to days | **< 30 seconds** (up to 500
> MB)
> > |
> > | **Metadata Writes** | Continue during migration | Blocked during
> > migration |
> > | **Data Plane** | Unaffected | Unaffected (publish/consume continues) |
> > | **Approach** | Continuous sync + dual-mode | Atomic copy + read-only
> > mode |
> > | **Consistency** | Dual-write (eventual consistency) | **Single
> > source of truth (always consistent)** |
> > | **Complexity** | High (dual-mode broker logic) | Low (transparent
> > wrapper) |
> > | **Version Preservation** | Not applicable (different metadata
> > models) | Yes (conditional writes preserved) |
> > | **Rollback** | Manual, complex | Automatic on failure |
> > | **Monitoring** | Requires lag tracking | Simple phase monitoring |
> >
> > ### Why Pulsar's Approach Differs
> >
> > 1. **Data Plane Independence**: **The key insight is that Pulsar's
> > data plane (publish/consume, ledger writes) does not require metadata
> > writes to function.** This architectural property allows pausing
> > metadata writes for a brief period (< 30 seconds) without affecting
> > data operations. This is what makes the migration **provably safe and
> > consistent**, not the metadata size.
> >
> > 2. **Write-Pause Safety**: Pausing writes during copy ensures:
> >    - No dual-write complexity
> >    - No data divergence between stores
> >    - No conflict resolution needed
> >    - Guaranteed consistency
> >
> >    This works regardless of metadata size - whether 50K nodes or
> > millions of topics. The migration handles large metadata volumes
> > through high concurrency (1000 parallel operations), completing in <
> > 30 seconds even for 500 MB.
> >
> > 3. **Ephemeral Node Handling**: Pulsar has significant ephemeral
> > metadata (broker registrations, bundle ownership), making dual-write
> > complex. Read-only mode simplifies this.
> >
> > 4. **Conditional Writes**: Pulsar relies heavily on compare-and-set
> > operations. Version preservation ensures these continue working
> > post-migration, which Kafka doesn't need to address.
> >
> > 5. **Architectural Enabler**: Pulsar's separation of data plane and
> > metadata plane allows brief metadata write pauses without data plane
> > impact, enabling a simpler, safer migration approach.
> >
> > ### Lessons from Kafka's Experience
> >
> > Pulsar's design incorporates lessons from Kafka's migration:
> >
> > - ✅ **Avoid dual-write complexity**: Kafka found dual-mode operation
> > added significant code complexity. Pulsar's read-only approach is
> > simpler **and guarantees consistency**.
> > - ✅ **Clear phase boundaries**: Kafka's migration has unclear
> > "completion" point. Pulsar has explicit COMPLETED phase.
> > - ✅ **Automatic participant coordination**: Kafka requires manual
> > broker restarts. Pulsar participants coordinate automatically.
> > - ✅ **Fast migration**: **< 30 seconds** read-only window is
> > acceptable for most production environments
> > - ❌ **Brief write unavailability**: Pulsar accepts brief metadata
> > write unavailability (< 30 sec) vs Kafka's continuous operation, but
> > gains guaranteed consistency and simplicity.
> >
> >
> > ## References
> >
> > - [PIP-45: Pluggable metadata
> > interface](
> >
> https://github.com/apache/pulsar/wiki/PIP-45%3A-Pluggable-metadata-interface
> > )
> > - [Oxia: A Scalable Metadata Store](https://github.com/streamnative/oxia
> )
> > - [MetadataStore
> > Interface](
> >
> https://github.com/apache/pulsar/blob/master/pulsar-metadata/src/main/java/org/apache/pulsar/metadata/api/MetadataStore.java
> > )
> > - [KIP-866: ZooKeeper to KRaft
> > Migration](
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-866+ZooKeeper+to+KRaft+Migration
> > )
> > - Kafka's approach to metadata store migration
> >
> >
> > --
> > Matteo Merli
> > <[email protected]>
> >
>

Re: [VOTE] PIP-454: Metadata Store Migration Framework

Reply via email to