+1 (binding)
On Sun, 8 Feb 2026 at 17:17, Tao Jiuming <[email protected]> wrote: > +1 nonbinding > > Matteo Merli <[email protected]>于2026年2月6日 周五01:43写道: > > > PIP PR: https://github.com/apache/pulsar/pull/25196 > > > > PR with implementation: https://github.com/apache/pulsar/pull/25219 > > > > ---- > > > > # PIP-454: Metadata Store Migration Framework > > > > ## Motivation > > > > Apache Pulsar currently uses Apache ZooKeeper as its metadata store > > for broker coordination, topic metadata, namespace policies, and > > BookKeeper ledger management. While ZooKeeper has served well, there > > are several motivations for enabling migration to alternative metadata > > stores: > > > > 1. **Operational Simplicity**: Alternative metadata stores like Oxia > > may offer simpler operations, better observability, or reduced > > operational overhead compared to ZooKeeper ensembles. > > > > 2. **Performance Characteristics**: Different metadata stores have > > different performance profiles. Some workloads may benefit from stores > > optimized for high throughput or low latency. > > > > 3. **Deployment Flexibility**: Organizations may prefer metadata > > stores that align better with their existing infrastructure and > > expertise. > > > > 4. **Zero-Downtime Migration**: Operators need a safe, automated way > > to migrate metadata between stores without service interruption. > > > > Currently, there is no supported path to migrate from one metadata > > store to another without cluster downtime. This PIP proposes a **safe, > > simple migration framework** that ensures metadata consistency by > > avoiding complex dual-write/dual-read patterns. The framework enables: > > > > - **Zero-downtime migration** from any metadata store to any other > > supported store > > - **Automatic ephemeral node recreation** in the target store > > - **Version preservation** to ensure conditional writes continue working > > - **Automatic failure recovery** if issues are detected > > - **Minimal configuration changes** - no config updates needed until > > after migration completes > > > > ## Goal > > > > Provide a safe, automated framework for migrating Apache Pulsar's > > metadata from one store implementation (e.g., ZooKeeper) to another > > (e.g., Oxia) with zero service interruption. > > > > ### In Scope > > > > - Migration framework supporting any source → any target metadata store > > - Automatic ephemeral node recreation by brokers and bookies > > - Persistent data copy with version preservation > > - CLI commands for migration control and monitoring > > - Automatic failure recovery during migration > > - Support for broker and bookie participation > > - Read-only mode during migration for consistency > > > > ### Out of Scope > > > > - Developing new metadata store implementations (Oxia, Etcd support > > already exists) > > - Cross-cluster metadata synchronization (different use case) > > - Automated rollback after COMPLETED phase (requires manual intervention) > > - Migration of configuration metadata store and geo-replicated > > clusters (can be done separately) > > > > ## High Level Design > > > > The migration framework introduces a **DualMetadataStore** wrapper > > that transparently handles migration without modifying existing > > metadata store implementations. > > > > ### Key Principles > > > > 1. **Transparent Wrapping**: The `DualMetadataStore` wraps the > > existing source store (e.g., `ZKMetadataStore`) without modifying its > > implementation. > > > > 2. **Lazy Target Initialization**: The target store is only > > initialized when migration begins, triggered by a flag in the source > > store. > > > > 3. **Ephemeral-First Approach**: Before copying persistent data, all > > brokers and bookies recreate their ephemeral nodes in the target > > store. This ensures the cluster is "live" in both stores during > > migration. > > > > 4. **Read-Only Mode During Migration**: To ensure consistency, all > > metadata writes are blocked during PREPARATION and COPYING phases. > > Components receive `SessionLost` events to defer non-critical > > operations (e.g., ledger rollovers). > > > > 5. **Phase-Based Migration**: Migration proceeds through well-defined > > phases (PREPARATION → COPYING → COMPLETED). > > > > 6. **Generic Framework**: The framework is agnostic to specific store > > implementations - it works with any source and target that implement > > the `MetadataStore` interface. > > > > 7. **Guaranteed Consistency**: By blocking writes during migration and > > using atomic copy, metadata is **always in a consistent state**. No > > dual-write complexity, no data divergence, no consistency issues. > > > > ## Detailed Design > > > > ### Migration Phases > > > > ``` > > NOT_STARTED > > ↓ > > PREPARATION ← All brokers/bookies recreate ephemeral nodes in target > > ← Metadata writes are BLOCKED (read-only mode) > > ↓ > > COPYING ← Coordinator copies persistent data source → target > > ← Metadata writes still BLOCKED > > ↓ > > COMPLETED ← Migration complete, all services using target store > > ← Metadata writes ENABLED on target > > ↓ > > After validation period: > > * Update config and restart brokers & bookies > > * Decommission source store > > > > (If errors occur): > > FAILED ← Rollback to source store, writes ENABLED > > ``` > > > > ### Phase 1: NOT_STARTED → PREPARATION > > > > **Participant Registration (at startup):** > > Each broker and bookie registers itself as a migration participant by > > creating a sequential ephemeral node: > > - Path: `/pulsar/migration-coordinator/participants/id-NNNN` (sequential) > > - This allows the coordinator to know how many participants exist > > before migration starts > > > > **Administrator triggers migration:** > > ```bash > > pulsar-admin metadata-migration start --target oxia://oxia1:6648 > > ``` > > > > **Coordinator actions:** > > 1. Creates migration flag in source store: > > `/pulsar/migration-coordinator/migration` > > ```json > > { > > "phase": "PREPARATION", > > "targetUrl": "oxia://oxia1:6648" > > } > > ``` > > > > **Broker/Bookie actions (automatic, triggered by watching the flag):** > > 1. Detect migration flag via watch on > > `/pulsar/migration-coordinator/migration` > > 2. Defer non-critical metadata writes (e.g., ledger rollovers, bundle > > ownership changes) > > 3. Initialize connection to target store > > 4. Recreate ALL ephemeral nodes in target store > > 5. **Delete** participant registration node to signal "ready" > > > > **Coordinator waits for all participant nodes to be deleted > > (indicating all participants are ready)** > > > > ### Phase 2: PREPARATION → COPYING > > > > **Coordinator actions:** > > 1. Updates phase to `COPYING` > > 2. Performs recursive copy of persistent data from source → target: > > - Skips ephemeral nodes (already recreated) > > - Concurrent operations limited by semaphore (default: 1000 pending > ops) > > - Breadth-first traversal to process all paths > > - Progress logged periodically > > > > **During this phase:** > > - Brokers/bookies continue normal READ operations > > - Metadata WRITES are BLOCKED (return failure) > > - Ephemeral nodes remain alive in both stores > > - All reads still go to source store > > > > **During this phase:** > > - Metadata writes are BLOCKED (return error to clients) > > - Metadata reads continue normally from source store > > - **Data plane operations unaffected**: Publish/consume, ledger writes > > continue normally > > - Version-id and modification count preserved using direct Oxia client > > - Breadth-first traversal with max 1000 concurrent operations > > > > **Estimated duration:** > > - **< 30 seconds** for typical deployments with up to **500 MB of > > metadata** in ZooKeeper > > > > **Impact on operations:** > > - ✅ Existing topics: Publish and consume continue without interruption > > - ✅ BookKeeper: Ledger writes and reads continue normally > > - ✅ Clients: Connected producers and consumers unaffected > > - ❌ Admin operations: Topic/namespace creation blocked temporarily > > - ❌ Bundle operations: Load balancing deferred until completion > > > > ### Phase 3: COPYING → COMPLETED > > > > **Coordinator actions:** > > 1. Updates phase to `COMPLETED` > > 2. Logs success message with total copied node count > > > > **Broker/Bookie actions (automatic, triggered by phase update):** > > 1. Detect `COMPLETED` phase > > 2. Deferred operations can now proceed > > 3. Switch routing: > > - **Writes**: Go to target store only > > - **Reads**: Go to target store only > > > > **At this point:** > > - Cluster is running on target store > > - Source store remains available for safety > > - Metadata writes are enabled again > > > > **Operator follow-up (after validation period):** > > 1. Update configuration files: > > ```properties > > # Before (ZooKeeper): > > metadataStoreUrl=zk://zk1:2181,zk2:2181/pulsar > > > > # After (Oxia): > > metadataStoreUrl=oxia://oxia1:6648 > > ``` > > 2. Perform rolling restart with new config > > 3. After all services restarted, decommission source store > > > > ### Failure Handling: ANY_PHASE → FAILED > > > > **If migration fails at any point:** > > 1. Coordinator updates phase to `FAILED` > > 2. Broker/Bookie actions: > > - Detect `FAILED` phase > > - Discard target store connection > > - Continue using source store > > - Metadata writes enabled again > > > > **Operator actions:** > > 1. Review logs to understand failure cause > > 2. Fix underlying issue > > 3. Retry migration with `pulsar-admin metadata-migration start --target > > <url>` > > > > ## Implementation Details > > > > > > ### Key Implementation Details: > > > > 1. **Direct Oxia Client Usage**: The coordinator uses > > `AsyncOxiaClient` directly instead of going through `MetadataStore` > > interface. This allows setting version-id and modification count to > > match the source values, ensuring conditional writes (compare-and-set > > operations) continue to work correctly after migration. > > > > 2. **Breadth-First Traversal**: Processes paths level by level using a > > work queue, enabling high concurrency while preventing deep recursion. > > > > 3. **Concurrent Operations**: Uses a semaphore to limit pending > > operations (default: 1000), balancing throughput with memory usage. > > > > ### Data Structures > > > > **Migration State** (`/pulsar/migration-coordinator/migration`): > > ```json > > { > > "phase": "PREPARATION", > > "targetUrl": "oxia://oxia1:6648/default" > > } > > ``` > > > > Fields: > > - `phase`: Current migration phase (NOT_STARTED, PREPARATION, COPYING, > > COMPLETED, FAILED) > > - `targetUrl`: Target metadata store URL (e.g., > > `oxia://oxia1:6648/default`) > > > > **Participant Registration** > > (`/pulsar/migration-coordinator/participants/id-NNNN`): > > - Sequential ephemeral node created by each broker/bookie at startup > > - Empty data (presence indicates participation) > > - Deleted by participant when preparation complete (signals "ready") > > - Coordinator waits for all to be deleted before proceeding to COPYING > > phase > > > > **No additional state tracking**: The simplified design removes > > complex state tracking and checksums. Migration state is kept minimal. > > > > ### CLI Commands > > > > ```bash > > # Start migration > > pulsar-admin metadata-migration start --target <target-url> > > > > # Check status > > pulsar-admin metadata-migration status > > ``` > > > > The simplified design only requires two commands. Rollback happens > > automatically if migration fails (phase transitions to FAILED). > > > > ### REST API > > > > ``` > > POST /admin/v2/metadata/migration/start > > Body: { "targetUrl": "oxia://..." } > > > > GET /admin/v2/metadata/migration/status > > Returns: { "phase": "COPYING", "targetUrl": "oxia://..." } > > ``` > > > > ## Safety Guarantees > > > > ### Why This Approach is Safe > > > > **The migration design guarantees metadata consistency by avoiding > > dual-write and dual-read patterns entirely:** > > > > 1. **Single Source of Truth**: At any given time, there is exactly ONE > > active metadata store: > > - Before migration: Source store (ZooKeeper) > > - During PREPARATION and COPYING: Source store (read-only) > > - After COMPLETED: Target store (Oxia) > > > > 2. **No Dual-Write Complexity**: Unlike approaches that write to both > > stores simultaneously, this design eliminates: > > - Write synchronization issues > > - Conflict resolution between stores > > - Data divergence problems > > - Partial failure handling complexity > > > > 3. **No Dual-Read Complexity**: Unlike approaches that read from both > > stores, this design eliminates: > > - Read consistency issues > > - Cache invalidation across stores > > - Stale data problems > > - Complex fallback logic > > > > 4. **Atomic Cutover**: All participants switch stores simultaneously > > when COMPLETED phase is detected. There is no ambiguous state where > > some participants use one store and others use another. > > > > 5. **Fast Migration Window**: With **< 30 seconds** for typical > > metadata sizes (even up to 500 MB), the read-only window is minimal > > and acceptable for most production environments. > > > > **Bottom line**: Metadata is **always in a consistent state** - either > > fully in the source store or fully in the target store, never split or > > diverged between them. > > > > ### Data Integrity > > > > 1. **Version Preservation**: All persistent data is copied with > > original version-id and modification count preserved. This ensures > > conditional writes (compare-and-set operations) continue working after > > migration. > > > > 2. **Ephemeral Node Recreation**: All ephemeral nodes are recreated by > > their owning brokers/bookies before persistent data copy begins. > > > > 3. **Read-Only Mode**: All metadata writes are blocked during > > PREPARATION and COPYING phases, ensuring no data inconsistencies > > during migration. > > > > **Important**: Read-only mode only affects metadata operations. > > Data plane operations continue normally: > > - ✅ **Publishing and consuming messages** works without interruption > > - ✅ **Reading from existing topics and subscriptions** works normally > > - ✅ **Ledger writes to BookKeeper** continue unaffected > > - ❌ **Creating new topics or subscriptions** will be blocked > temporarily > > - ❌ **Namespace/policy updates** will be blocked temporarily > > - ❌ **Bundle ownership changes** will be deferred until migration > > completes > > > > ### Operational Safety > > > > 1. **No Downtime**: Brokers and bookies remain online throughout the > > migration. **Data plane operations (publish/consume) continue without > > interruption.** Only metadata operations are temporarily blocked > > during the migration phases. > > > > 2. **Graceful Failure**: If migration fails at any point, phase > > transitions to FAILED and cluster returns to source store > > automatically. > > > > 3. **Session Events**: Components receive `SessionLost` event during > > migration to defer non-critical writes (e.g., ledger rollovers), and > > `SessionReestablished` when migration completes or fails. > > > > 4. **Participant Coordination**: Migration waits for all participants > > to complete preparation before copying data. > > > > ### Consistency > > > > 1. **Atomic Cutover**: All participants switch to target store > > simultaneously when COMPLETED phase is detected. > > > > 2. **Ephemeral Session Consistency**: Each participant manages its own > > ephemeral nodes in target store with proper session management. > > > > 3. **No Dual-Write Complexity**: By blocking writes during migration, > > we avoid complex dual-write error handling and data divergence issues. > > > > ## Configuration > > > > ### No Configuration Changes for Migration > > > > The beauty of this design is that **no configuration changes are > > needed to start migration**: > > > > - Brokers and bookies continue using their existing `metadataStoreUrl` > > config > > - The `DualMetadataStore` wrapper is automatically applied when using > > ZooKeeper > > - Target URL is provided only when triggering migration via CLI > > > > ### Post-Migration Configuration > > > > After migration completes and validation period ends, update config > files: > > > > ```properties > > # Before migration > > metadataStoreUrl=zk://zk1:2181,zk2:2181,zk3:2181/pulsar > > > > # After migration (update and rolling restart) > > metadataStoreUrl=oxia://oxia1:6648 > > ``` > > > > ## Comparison with Kafka's ZooKeeper → KRaft Migration > > > > Apache Kafka faced a similar challenge migrating from ZooKeeper to > > KRaft (Kafka Raft). Their approach provides useful comparison points: > > > > ### Kafka's Approach (KIP-866) > > > > **Migration Strategy:** > > - **Dual-mode operation**: Kafka brokers run in a hybrid mode where > > the KRaft controller reads from ZooKeeper > > - **Metadata synchronization**: KRaft controller actively mirrors > > metadata from ZooKeeper to KRaft > > - **Phased cutover**: Operators manually transition from ZK_MIGRATION > > mode to KRAFT mode > > - **Write forwarding**: During migration, metadata writes go to > > ZooKeeper and are replicated to KRaft > > > > **Timeline:** > > - Migration can take hours or days as metadata is continuously > synchronized > > - Requires careful monitoring of lag between ZooKeeper and KRaft > > - Rollback possible until final KRAFT mode is committed > > > > ### Pulsar's Approach (This PIP) > > > > **Migration Strategy:** > > - **Transparent wrapper**: DualMetadataStore wraps existing store > > without broker code changes > > - **Read-only migration**: Metadata writes blocked during migration (< > > 30 seconds for most clusters) > > - **Atomic copy**: All persistent data copied in one operation with > > version preservation > > - **Single source of truth**: No dual-write or dual-read - metadata > > always consistent > > - **Automatic cutover**: All participants switch simultaneously when > > COMPLETED phase detected > > > > **Timeline:** > > - Migration completes in **< 30 seconds** for typical deployments > > (even up to 500 MB metadata) > > - No lag monitoring needed > > - Automatic rollback on failure (FAILED phase) > > > > ### Key Differences > > > > | Aspect | Kafka (ZK → KRaft) | Pulsar (ZK → Oxia) | > > |--------|-------------------|-------------------| > > | **Migration Duration** | Hours to days | **< 30 seconds** (up to 500 > MB) > > | > > | **Metadata Writes** | Continue during migration | Blocked during > > migration | > > | **Data Plane** | Unaffected | Unaffected (publish/consume continues) | > > | **Approach** | Continuous sync + dual-mode | Atomic copy + read-only > > mode | > > | **Consistency** | Dual-write (eventual consistency) | **Single > > source of truth (always consistent)** | > > | **Complexity** | High (dual-mode broker logic) | Low (transparent > > wrapper) | > > | **Version Preservation** | Not applicable (different metadata > > models) | Yes (conditional writes preserved) | > > | **Rollback** | Manual, complex | Automatic on failure | > > | **Monitoring** | Requires lag tracking | Simple phase monitoring | > > > > ### Why Pulsar's Approach Differs > > > > 1. **Data Plane Independence**: **The key insight is that Pulsar's > > data plane (publish/consume, ledger writes) does not require metadata > > writes to function.** This architectural property allows pausing > > metadata writes for a brief period (< 30 seconds) without affecting > > data operations. This is what makes the migration **provably safe and > > consistent**, not the metadata size. > > > > 2. **Write-Pause Safety**: Pausing writes during copy ensures: > > - No dual-write complexity > > - No data divergence between stores > > - No conflict resolution needed > > - Guaranteed consistency > > > > This works regardless of metadata size - whether 50K nodes or > > millions of topics. The migration handles large metadata volumes > > through high concurrency (1000 parallel operations), completing in < > > 30 seconds even for 500 MB. > > > > 3. **Ephemeral Node Handling**: Pulsar has significant ephemeral > > metadata (broker registrations, bundle ownership), making dual-write > > complex. Read-only mode simplifies this. > > > > 4. **Conditional Writes**: Pulsar relies heavily on compare-and-set > > operations. Version preservation ensures these continue working > > post-migration, which Kafka doesn't need to address. > > > > 5. **Architectural Enabler**: Pulsar's separation of data plane and > > metadata plane allows brief metadata write pauses without data plane > > impact, enabling a simpler, safer migration approach. > > > > ### Lessons from Kafka's Experience > > > > Pulsar's design incorporates lessons from Kafka's migration: > > > > - ✅ **Avoid dual-write complexity**: Kafka found dual-mode operation > > added significant code complexity. Pulsar's read-only approach is > > simpler **and guarantees consistency**. > > - ✅ **Clear phase boundaries**: Kafka's migration has unclear > > "completion" point. Pulsar has explicit COMPLETED phase. > > - ✅ **Automatic participant coordination**: Kafka requires manual > > broker restarts. Pulsar participants coordinate automatically. > > - ✅ **Fast migration**: **< 30 seconds** read-only window is > > acceptable for most production environments > > - ❌ **Brief write unavailability**: Pulsar accepts brief metadata > > write unavailability (< 30 sec) vs Kafka's continuous operation, but > > gains guaranteed consistency and simplicity. > > > > > > ## References > > > > - [PIP-45: Pluggable metadata > > interface]( > > > https://github.com/apache/pulsar/wiki/PIP-45%3A-Pluggable-metadata-interface > > ) > > - [Oxia: A Scalable Metadata Store](https://github.com/streamnative/oxia > ) > > - [MetadataStore > > Interface]( > > > https://github.com/apache/pulsar/blob/master/pulsar-metadata/src/main/java/org/apache/pulsar/metadata/api/MetadataStore.java > > ) > > - [KIP-866: ZooKeeper to KRaft > > Migration]( > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-866+ZooKeeper+to+KRaft+Migration > > ) > > - Kafka's approach to metadata store migration > > > > > > -- > > Matteo Merli > > <[email protected]> > > >
