[jira] [Updated] (HDDS-12307) Realtime Cross-Region Bucket Replication

Ivan Andika (Jira) Mon, 03 Mar 2025 21:47:27 -0800


     [ 
https://issues.apache.org/jira/browse/HDDS-12307?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Ivan Andika updated HDDS-12307:
-------------------------------
    Description: 
Currently, there are a few cross-regions (geo-replicated) DR solution for Ozone 
bucket
 * Run a periodic distcp from the source bucket to the target bucket
 * Take a snapshot on the bucket and send it to the remote site DR sites 
([https://ozone.apache.org/docs/edge/feature/snapshot.html])

There are pros and cons for the current approach
 * Pros
 ** It is simpler: Setting up periodic jobs can be done quite easily (e.g. 
using cronjobs)
 ** Distcp implementation will setup a map reduce jobs that will parallelize 
the copy from the source and the target cluster
 ** No additional Ozone components needed
 * Cons
 ** It is not “realtime”: The freshness of the data depends on how frequent and 
how fast the distcp runs
 ** It incurs significant overhead to the source cluster: It requires scanning 
of all the files in the source cluster (possibly in the target cluster)

Cloudera Replication Manager 
([https://docs.cloudera.com/replication-manager/1.5.4/replication-policies/topics/rm-pvce-understand-ozone-replication-policy.html])
 adds an incremental replication support after the initial bootstrap step by 
taking the snapdiff between two snapshots. This is better since there is no 
need to list all the keys under the bucket again, but it's not technically 
realtime.

This ticket is track possible solutions for a realtime bucket async replication 
between two clusters in different regions (with 100+ms latency). The current 
idea is have a CDC (Change Data Capture) framework on the OM which will be sent 
to a Replication subsystem (Replication Queue will receive the delta from the 
CDC and Syncer (Replicator) will process the delta in the queue and replicate 
it to the target cluster).
 * The CDC component can subscribe and receive any updates for the buckets 
through the gRPC bidirectional streaming APIs
 ** The choice of the "change data" can be either Raft logs / RocksDB WAL logs 
/ OM Audit logs
 * Replication Queue can be something like Kafka topic / Ratis logservice 
(https://issues.apache.org/jira/browse/RATIS-271) / local persistent queue 
(e.g. Chronicle Queue)
 * Replication subsystem can use a master-worker pattern
 ** The master will receive the bucket replication task from user and assign a 
worker to run the cross-region replication tasks
 *** If any worker failed, the master will reassign the replication to another 
task
 ** The worker will run the replication task assigned from the master
 ** We can use OM service as a master service (e.g. source cluster OM service)
 * Replication subsystem can read from OM follower / listener to not affect the 
OM leader
 ** OM listener would be better since it will never be promoted to a leader, 
while OM follower might, which might require additional logic to change OM 
follower to change.

There are two major steps when setting up the bucket replication
 # Initial batch replication of source bucket (bootstrap)
 ** Backfill the newly created bucket with the existing objects from the source 
bucket
 ** The replication subsystem will take the bucket snapshot, divide the keys 
into partitions and send to the workers for replications
 # Asynchronous bucket live replication
 ** Tailing the subsequent changes (CDC) of the source bucket and replicating 
it to the destination bucket
 ** There are two phases
 *** Incremental snapshot: Similar to the Cloudera Replication Manager, the 
replication subsystem takes periodic snapshot and uses the snapdiff between the 
previous bucket snapshot and the current bucket snapshot and replicate the 
changes to the target bucket.
 **** This would be the baseline implementation.
 *** Near-realtime bucket replication: The main idea is that each Syncer 
(Replicator) will create a persistent gRPC bidirection streaming channel (using 
HTTP/2) with one of the OM nodes that will send the log entries related to keys 
for the specific bucket to be created through a Log Tailer. The Syncer will 
then persist the log entries to its internal persistent queue which will be 
consumed by a work pool to replicate the data to the destination bucket.

  was:
Currently, there are a few cross-regions (geo-replicated) DR solution for Ozone 
bucket
 * Run a periodic distcp from the source bucket to the target bucket
 * Take a snapshot on the bucket and send it to the remote site DR sites 
([https://ozone.apache.org/docs/edge/feature/snapshot.html])

There are pros and cons for the current approach
 * Pros
 ** It is simpler: Setting up periodic jobs can be done quite easily (e.g. 
using cronjobs)
 ** Distcp implementation will setup a map reduce jobs that will parallelize 
the copy from the source and the cluster
 ** No additional Ozone components needed
 * Cons
 ** It is not “realtime”: The freshness of the data depends on how frequent and 
how fast the distcp runs
 ** It incurs significant overhead to the source cluster: It requires scanning 
of all the files in the source cluster (possibly in the target cluster)

Cloudera Replication Manager 
([https://docs.cloudera.com/replication-manager/1.5.4/replication-policies/topics/rm-pvce-understand-ozone-replication-policy.html])
 adds an incremental replication support after the initial bootstrap step by 
taking the snapdiff between two snapshots. This is better since there is no 
need to list all the keys under the bucket again, but it's not technically 
realtime.

This ticket is track possible solutions for a realtime bucket async replication 
between two clusters in different regions (with 100+ms latency). The current 
idea is have a CDC (Change Data Capture) framework on the OM which will be sent 
to a Replication subsystem (Replication Queue will receive the delta from the 
CDC and Syncer (Replicator) will process the delta in the queue and replicate 
it to the target cluster).
 * The CDC component can subscribe and receive any updates for the buckets 
through the gRPC bidirectional streaming APIs
 ** The choice of the "change data" can be either Raft logs / RocksDB WAL logs 
/ OM Audit logs
 * Replication Queue can be something like Kafka topic / Ratis logservice 
(https://issues.apache.org/jira/browse/RATIS-271) / local persistent queue 
(e.g. Chronicle Queue)
 * Replication subsystem can use a master-worker pattern
 ** The master will receive the bucket replication task from user and assign a 
worker to run the cross-region replication tasks
 *** If any worker failed, the master will reassign the replication to another 
task
 ** The worker will run the replication task assigned from the master
 ** We can use OM service as a master service (e.g. source cluster OM service)
 * Replication subsystem can read from OM follower / listener to not affect the 
OM leader
 ** OM listener would be better since it will never be promoted to a leader, 
while OM follower might, which might require additional logic to change OM 
follower to change.

There are two major steps when setting up the bucket replication
 # Initial batch replication of source bucket (bootstrap)
 ** Backfill the newly created bucket with the existing objects from the source 
bucket
 ** The replication subsystem will take the bucket snapshot, divide the keys 
into partitions and send to the workers for replications
 # Asynchronous bucket live replication
 ** Tailing the subsequent changes (CDC) of the source bucket and replicating 
it to the destination bucket
 ** There are two phases
 *** Incremental snapshot: Similar to the Cloudera Replication Manager, the 
replication subsystem takes periodic snapshot and uses the snapdiff between the 
previous bucket snapshot and the current bucket snapshot and replicate the 
changes to the target bucket.
 **** This would be the baseline implementation.
 *** Near-realtime bucket replication: The main idea is that each Syncer 
(Replicator) will create a persistent gRPC bidirection streaming channel (using 
HTTP/2) with one of the OM nodes that will send the log entries related to keys 
for the specific bucket to be created through a Log Tailer. The Syncer will 
then persist the log entries to its internal persistent queue which will be 
consumed by a work pool to replicate the data to the destination bucket.


> Realtime Cross-Region Bucket Replication
> ----------------------------------------
>
>                 Key: HDDS-12307
>                 URL: https://issues.apache.org/jira/browse/HDDS-12307
>             Project: Apache Ozone
>          Issue Type: New Feature
>            Reporter: Ivan Andika
>            Assignee: Ivan Andika
>            Priority: Major
>
> Currently, there are a few cross-regions (geo-replicated) DR solution for 
> Ozone bucket
>  * Run a periodic distcp from the source bucket to the target bucket
>  * Take a snapshot on the bucket and send it to the remote site DR sites 
> ([https://ozone.apache.org/docs/edge/feature/snapshot.html])
> There are pros and cons for the current approach
>  * Pros
>  ** It is simpler: Setting up periodic jobs can be done quite easily (e.g. 
> using cronjobs)
>  ** Distcp implementation will setup a map reduce jobs that will parallelize 
> the copy from the source and the target cluster
>  ** No additional Ozone components needed
>  * Cons
>  ** It is not “realtime”: The freshness of the data depends on how frequent 
> and how fast the distcp runs
>  ** It incurs significant overhead to the source cluster: It requires 
> scanning of all the files in the source cluster (possibly in the target 
> cluster)
> Cloudera Replication Manager 
> ([https://docs.cloudera.com/replication-manager/1.5.4/replication-policies/topics/rm-pvce-understand-ozone-replication-policy.html])
>  adds an incremental replication support after the initial bootstrap step by 
> taking the snapdiff between two snapshots. This is better since there is no 
> need to list all the keys under the bucket again, but it's not technically 
> realtime.
> This ticket is track possible solutions for a realtime bucket async 
> replication between two clusters in different regions (with 100+ms latency). 
> The current idea is have a CDC (Change Data Capture) framework on the OM 
> which will be sent to a Replication subsystem (Replication Queue will receive 
> the delta from the CDC and Syncer (Replicator) will process the delta in the 
> queue and replicate it to the target cluster).
>  * The CDC component can subscribe and receive any updates for the buckets 
> through the gRPC bidirectional streaming APIs
>  ** The choice of the "change data" can be either Raft logs / RocksDB WAL 
> logs / OM Audit logs
>  * Replication Queue can be something like Kafka topic / Ratis logservice 
> (https://issues.apache.org/jira/browse/RATIS-271) / local persistent queue 
> (e.g. Chronicle Queue)
>  * Replication subsystem can use a master-worker pattern
>  ** The master will receive the bucket replication task from user and assign 
> a worker to run the cross-region replication tasks
>  *** If any worker failed, the master will reassign the replication to 
> another task
>  ** The worker will run the replication task assigned from the master
>  ** We can use OM service as a master service (e.g. source cluster OM service)
>  * Replication subsystem can read from OM follower / listener to not affect 
> the OM leader
>  ** OM listener would be better since it will never be promoted to a leader, 
> while OM follower might, which might require additional logic to change OM 
> follower to change.
> There are two major steps when setting up the bucket replication
>  # Initial batch replication of source bucket (bootstrap)
>  ** Backfill the newly created bucket with the existing objects from the 
> source bucket
>  ** The replication subsystem will take the bucket snapshot, divide the keys 
> into partitions and send to the workers for replications
>  # Asynchronous bucket live replication
>  ** Tailing the subsequent changes (CDC) of the source bucket and replicating 
> it to the destination bucket
>  ** There are two phases
>  *** Incremental snapshot: Similar to the Cloudera Replication Manager, the 
> replication subsystem takes periodic snapshot and uses the snapdiff between 
> the previous bucket snapshot and the current bucket snapshot and replicate 
> the changes to the target bucket.
>  **** This would be the baseline implementation.
>  *** Near-realtime bucket replication: The main idea is that each Syncer 
> (Replicator) will create a persistent gRPC bidirection streaming channel 
> (using HTTP/2) with one of the OM nodes that will send the log entries 
> related to keys for the specific bucket to be created through a Log Tailer. 
> The Syncer will then persist the log entries to its internal persistent queue 
> which will be consumed by a work pool to replicate the data to the 
> destination bucket.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (HDDS-12307) Realtime Cross-Region Bucket Replication

Reply via email to