Re: [PR] [improvement] split ccr.md to overview.md and feature.md [doris-website]

via GitHub Thu, 07 Nov 2024 19:22:02 -0800


yagagagaga commented on code in PR #1314:
URL: https://github.com/apache/doris-website/pull/1314#discussion_r1833645080



##########
docs/admin-manual/data-admin/ccr/overview.md:
##########
@@ -0,0 +1,606 @@
+---
+{
+    "title": "CCR (Cross Cluster Replication)",
+    "language": "en"
+}
+---
+
+<!--
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+  http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+
+# Cross Cluster Replication (CCR)
+## Overview
+
+Cross Cluster Replication (CCR) enables the synchronization of data changes 
from a source cluster to a target cluster at the database/table level. This 
feature can be used to ensure data availability for online services, isolate 
offline and online workloads, and build multiple data centers across various 
sites.
+
+CCR is applicable to the following scenarios:
+
+- Disaster recovery: This involves backing up enterprise data to another 
cluster and data center. In the event of a sudden incident causing business 
interruption or data loss, companies can recover data from the backup or 
quickly switch to the backup cluster. Disaster recovery is typically a 
must-have feature in use cases with high SLA requirements, such as those in 
finance, healthcare, and e-commerce.
+- Read/write separation: This is to isolate querying and writing operations to 
reduce their mutual impact and improve resource utilization. For example, in 
cases of high writing pressure or high concurrency, read/write separation can 
distribute read and write operations to read-only and write-only database 
instances in various regions. This helps ensure high database performance and 
stability.
+- Data transfer between headquarters and branch offices: In order to have 
unified data control and analysis within a corporation, the headquarters 
usually requires timely data synchronization from branch offices located in 
different regions. This avoids management confusion and wrong decision-making 
based on inconsistent data.
+- Isolated upgrades: During system cluster upgrades, there might be a need to 
roll back to a previous version. Many traditional upgrade methods do not allow 
rolling back due to incompatible metadata. CCR in Doris can address this issue 
by building a standby cluster for upgrade and conducting dual-running 
verification. Users can ungrade the clusters one by one. CCR is not dependent 
on specific versions, making version rollback feasible.
+
+### Task Categories
+
+CCR supports two categories of tasks: database-level and table-level. 
Database-level tasks synchronize data for an entire database, while table-level 
tasks synchronize data for a single table.
+
+## Design
+
+### Concepts
+
+- Source cluster: the cluster where business data is written and originates 
from, requiring Doris version 2.0
+
+- Target cluster: the destination cluster for cross cluster replication, 
requiring version 2.0
+
+- Binlog: the change log of the source cluster, including schema and data 
changes
+
+- Syncer: a lightweight process
+
+### Architecture description
+
+![ccr-architecture-description](/images/ccr-architecture-description.png)
+
+CCR relies on a lightweight process called syncer. Syncers retrieve binlogs 
from the source cluster, directly apply the metadata to the target cluster, and 
notify the target cluster to pull data from the source cluster. CCR allows both 
full and incremental data migration.
+
+### Sync Methods
+
+CCR supports four synchronization methods:
+
+| Sync Method   | Principle                                   | Trigger Timing 
                                  |
+| --------------| ------------------------------------------- | 
------------------------------------------------ |
+| Full Sync     | Full backup from upstream, restore downstream. | Triggered 
by the first synchronization or operation, see the feature list for details. |
+| Partial Sync  | Backup at the upstream table or partition level, restore at 
the downstream table or partition level. | Triggered by operations, see the 
feature list for details. |
+| TXN           | Incremental data synchronization, downstream starts syncing 
after upstream commits. | Triggered by operations, see the feature list for 
details. |
+| SQL           | Replay upstream operations' SQL at the downstream. | 
Triggered by operations, see the feature list for details. |
+
+### Usage
+
+The usage of CCR is straightforward. Simply start the syncer service and send 
a command, and the syncers will take care of the rest.
+
+1. Deploy the source Doris cluster.
+2. Deploy the target Doris cluster.
+3. Both the source and target clusters need to enable binlog. Configure the 
following information in the fe.conf and be.conf files of the source and target 
clusters:
+
+```SQL
+enable_feature_binlog=true
+```
+
+4. Deploy syncers
+
+Build CCR syncer
+
+```shell
+git clone https://github.com/selectdb/ccr-syncer
+cd ccr-syncer   
+bash build.sh <-j NUM_OF_THREAD> <--output SYNCER_OUTPUT_DIR>
+cd SYNCER_OUTPUT_DIR# Contact the Doris community for a free CCR binary package
+```
+
+
+Start and stop syncer
+
+
+```shell
+# Start
+cd bin && sh start_syncer.sh --daemon
+   
+# Stop
+sh stop_syncer.sh
+```
+
+5. Enable binlog in the source cluster.
+
+```shell
+-- If you want to synchronize the entire database, you can execute the 
following script:
+vim shell/enable_db_binlog.sh
+Modify host, port, user, password, and db in the source cluster
+Or ./enable_db_binlog.sh --host $host --port $port --user $user --password 
$password --db $db
+
+-- If you want to synchronize a single table, you can execute the following 
script and enable binlog for the target table:
+ALTER TABLE enable_binlog SET ("binlog.enable" = "true");
+```
+
+6. Launch a synchronization task to the syncer
+
+```shell
+curl -X POST -H "Content-Type: application/json" -d '{
+    "name": "ccr_test",
+    "src": {
+      "host": "localhost",
+      "port": "9030",
+      "thrift_port": "9020",
+      "user": "root",
+      "password": "",
+      "database": "your_db_name",
+      "table": "your_table_name"
+    },
+    "dest": {
+      "host": "localhost",
+      "port": "9030",
+      "thrift_port": "9020",
+      "user": "root",
+      "password": "",
+      "database": "your_db_name",
+      "table": "your_table_name"
+    }
+}' http://127.0.0.1:9190/create_ccr
+```
+
+Parameter description:
+
+```shell
+name: name of the CCR synchronization task, should be unique
+host, port: host and mysql(jdbc) port for the master FE for the corresponding 
cluster
+user, password: the credentials used by the syncer to initiate transactions, 
fetch data, etc.
+If it is synchronization at the database level, specify your_db_name and leave 
your_table_name empty
+If it is synchronization at the table level, specify both your_db_name and 
your_table_name
+The synchronization task name can only be used once.
+```
+
+## Operation manual for syncer
+
+### Start syncer
+
+Start syncer according to the configurations and save a pid file in the 
default or specified path. The name of the pid file should follow 
`host_port.pid`.
+
+**Output file structure**
+
+The file structure can be seen under the output path after compilation:
+
+```SQL
+output_dir
+    bin
+        ccr_syncer
+        enable_db_binlog.sh
+        start_syncer.sh
+        stop_syncer.sh
+    db
+        [ccr.db] # Generated after running with the default configurations.
+    log
+        [ccr_syncer.log] # Generated after running with the default 
configurations.
+```
+
+**The start_syncer.sh in the following text refers to the start_syncer.sh 
under its corresponding path.**
+
+**Start options**
+
+**--daemon** 
+
+Run syncer in the background, set to false by default.
+
+```SQL
+bash bin/start_syncer.sh --daemon
+```
+
+**--db_type** 
+
+Syncer can currently use two databases to store its metadata, `sqlite3 `(for 
local storage) and `mysql `(for local or remote storage).
+
+```SQL
+bash bin/start_syncer.sh --db_type mysql
+```
+
+The default value is sqlite3.
+
+When using MySQL to store metadata, syncer will use `CREATE IF NOT EXISTS `to 
create a database called `ccr`, where the metadata table related to CCR will be 
saved.
+
+**--db_dir** 
+
+**This option only works when db uses** **`sqlite3`****.**
+
+It allows you to specify the name and path of the db file generated by sqlite3.
+
+```SQL
+bash bin/start_syncer.sh --db_dir /path/to/ccr.db
+```
+
+The default path is `SYNCER_OUTPUT_DIR/db` and the default file name is 
`ccr.db`.
+
+**--db_host & db_port & db_user & db_password**
+
+**This option only works when db uses** **`mysql`****.**

Review Comment:
   ```suggestion
   **This option only works when db uses `mysql`.**
   ```



##########
docs/admin-manual/data-admin/ccr/overview.md:
##########
@@ -0,0 +1,606 @@
+---
+{
+    "title": "CCR (Cross Cluster Replication)",
+    "language": "en"
+}
+---
+
+<!--
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+  http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+
+# Cross Cluster Replication (CCR)
+## Overview
+
+Cross Cluster Replication (CCR) enables the synchronization of data changes 
from a source cluster to a target cluster at the database/table level. This 
feature can be used to ensure data availability for online services, isolate 
offline and online workloads, and build multiple data centers across various 
sites.
+
+CCR is applicable to the following scenarios:
+
+- Disaster recovery: This involves backing up enterprise data to another 
cluster and data center. In the event of a sudden incident causing business 
interruption or data loss, companies can recover data from the backup or 
quickly switch to the backup cluster. Disaster recovery is typically a 
must-have feature in use cases with high SLA requirements, such as those in 
finance, healthcare, and e-commerce.
+- Read/write separation: This is to isolate querying and writing operations to 
reduce their mutual impact and improve resource utilization. For example, in 
cases of high writing pressure or high concurrency, read/write separation can 
distribute read and write operations to read-only and write-only database 
instances in various regions. This helps ensure high database performance and 
stability.
+- Data transfer between headquarters and branch offices: In order to have 
unified data control and analysis within a corporation, the headquarters 
usually requires timely data synchronization from branch offices located in 
different regions. This avoids management confusion and wrong decision-making 
based on inconsistent data.
+- Isolated upgrades: During system cluster upgrades, there might be a need to 
roll back to a previous version. Many traditional upgrade methods do not allow 
rolling back due to incompatible metadata. CCR in Doris can address this issue 
by building a standby cluster for upgrade and conducting dual-running 
verification. Users can ungrade the clusters one by one. CCR is not dependent 
on specific versions, making version rollback feasible.
+
+### Task Categories
+
+CCR supports two categories of tasks: database-level and table-level. 
Database-level tasks synchronize data for an entire database, while table-level 
tasks synchronize data for a single table.
+
+## Design
+
+### Concepts
+
+- Source cluster: the cluster where business data is written and originates 
from, requiring Doris version 2.0
+
+- Target cluster: the destination cluster for cross cluster replication, 
requiring version 2.0
+
+- Binlog: the change log of the source cluster, including schema and data 
changes
+
+- Syncer: a lightweight process
+
+### Architecture description
+
+![ccr-architecture-description](/images/ccr-architecture-description.png)
+
+CCR relies on a lightweight process called syncer. Syncers retrieve binlogs 
from the source cluster, directly apply the metadata to the target cluster, and 
notify the target cluster to pull data from the source cluster. CCR allows both 
full and incremental data migration.
+
+### Sync Methods
+
+CCR supports four synchronization methods:
+
+| Sync Method   | Principle                                   | Trigger Timing 
                                  |
+| --------------| ------------------------------------------- | 
------------------------------------------------ |
+| Full Sync     | Full backup from upstream, restore downstream. | Triggered 
by the first synchronization or operation, see the feature list for details. |
+| Partial Sync  | Backup at the upstream table or partition level, restore at 
the downstream table or partition level. | Triggered by operations, see the 
feature list for details. |
+| TXN           | Incremental data synchronization, downstream starts syncing 
after upstream commits. | Triggered by operations, see the feature list for 
details. |
+| SQL           | Replay upstream operations' SQL at the downstream. | 
Triggered by operations, see the feature list for details. |
+
+### Usage
+
+The usage of CCR is straightforward. Simply start the syncer service and send 
a command, and the syncers will take care of the rest.
+
+1. Deploy the source Doris cluster.
+2. Deploy the target Doris cluster.
+3. Both the source and target clusters need to enable binlog. Configure the 
following information in the fe.conf and be.conf files of the source and target 
clusters:
+
+```SQL
+enable_feature_binlog=true
+```
+
+4. Deploy syncers
+
+Build CCR syncer
+
+```shell
+git clone https://github.com/selectdb/ccr-syncer
+cd ccr-syncer   
+bash build.sh <-j NUM_OF_THREAD> <--output SYNCER_OUTPUT_DIR>
+cd SYNCER_OUTPUT_DIR# Contact the Doris community for a free CCR binary package
+```
+
+
+Start and stop syncer
+
+
+```shell
+# Start
+cd bin && sh start_syncer.sh --daemon
+   
+# Stop
+sh stop_syncer.sh
+```
+
+5. Enable binlog in the source cluster.
+
+```shell
+-- If you want to synchronize the entire database, you can execute the 
following script:
+vim shell/enable_db_binlog.sh
+Modify host, port, user, password, and db in the source cluster
+Or ./enable_db_binlog.sh --host $host --port $port --user $user --password 
$password --db $db
+
+-- If you want to synchronize a single table, you can execute the following 
script and enable binlog for the target table:
+ALTER TABLE enable_binlog SET ("binlog.enable" = "true");
+```
+
+6. Launch a synchronization task to the syncer
+
+```shell
+curl -X POST -H "Content-Type: application/json" -d '{
+    "name": "ccr_test",
+    "src": {
+      "host": "localhost",
+      "port": "9030",
+      "thrift_port": "9020",
+      "user": "root",
+      "password": "",
+      "database": "your_db_name",
+      "table": "your_table_name"
+    },
+    "dest": {
+      "host": "localhost",
+      "port": "9030",
+      "thrift_port": "9020",
+      "user": "root",
+      "password": "",
+      "database": "your_db_name",
+      "table": "your_table_name"
+    }
+}' http://127.0.0.1:9190/create_ccr
+```
+
+Parameter description:
+
+```shell
+name: name of the CCR synchronization task, should be unique
+host, port: host and mysql(jdbc) port for the master FE for the corresponding 
cluster
+user, password: the credentials used by the syncer to initiate transactions, 
fetch data, etc.
+If it is synchronization at the database level, specify your_db_name and leave 
your_table_name empty
+If it is synchronization at the table level, specify both your_db_name and 
your_table_name
+The synchronization task name can only be used once.
+```
+
+## Operation manual for syncer
+
+### Start syncer
+
+Start syncer according to the configurations and save a pid file in the 
default or specified path. The name of the pid file should follow 
`host_port.pid`.
+
+**Output file structure**
+
+The file structure can be seen under the output path after compilation:
+
+```SQL
+output_dir
+    bin
+        ccr_syncer
+        enable_db_binlog.sh
+        start_syncer.sh
+        stop_syncer.sh
+    db
+        [ccr.db] # Generated after running with the default configurations.
+    log
+        [ccr_syncer.log] # Generated after running with the default 
configurations.
+```
+
+**The start_syncer.sh in the following text refers to the start_syncer.sh 
under its corresponding path.**
+
+**Start options**
+
+**--daemon** 
+
+Run syncer in the background, set to false by default.
+
+```SQL
+bash bin/start_syncer.sh --daemon
+```
+
+**--db_type** 
+
+Syncer can currently use two databases to store its metadata, `sqlite3 `(for 
local storage) and `mysql `(for local or remote storage).
+
+```SQL
+bash bin/start_syncer.sh --db_type mysql
+```
+
+The default value is sqlite3.
+
+When using MySQL to store metadata, syncer will use `CREATE IF NOT EXISTS `to 
create a database called `ccr`, where the metadata table related to CCR will be 
saved.
+
+**--db_dir** 
+
+**This option only works when db uses** **`sqlite3`****.**

Review Comment:
   ```suggestion
   **This option only works when db uses `sqlite3`.**
   ```



##########
i18n/zh-CN/docusaurus-plugin-content-docs/current/admin-manual/data-admin/ccr/overview.md:
##########
@@ -0,0 +1,647 @@
+---
+{
+    "title": "跨集群数据同步",
+    "language": "zh-CN"
+}
+---
+
+<!--
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+  http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+
+## 概览
+
+### CCR 是什么
+
+CCR(Cross Cluster Replication) 
是跨集群数据同步，能够在库/表级别将源集群的数据变更同步到目标集群，可用于在线服务的数据可用性、隔离在离线负载、建设两地三中心。
+
+### 适用场景
+
+CCR 通常被用于容灾备份、读写分离、集团与公司间数据传输和隔离升级等场景。
+
+- 容灾备份：通常是将企业的数据备份到另一个集群与机房中，当突发事件导致业务中断或丢失时，可以从备份中恢复数据或快速进行主备切换。一般在对 SLA 
要求比较高的场景中，都需要进行容灾备份，比如在金融、医疗、电子商务等领域中比较常见。
+
+- 
读写分离：读写分离是将数据的查询操作和写入操作进行分离，目的是降低读写操作的相互影响并提升资源的利用率。比如在数据库写入压力过大或在高并发场景中，采用读写分离可以将读/写操作分散到多个地域的只读/只写的数据库案例上，减少读写间的互相影响，有效保证数据库的性能及稳定性。
+
+- 
集团与分公司间数据传输：集团总部为了对集团内数据进行统一管控和分析，通常需要分布在各地域的分公司及时将数据传输同步到集团总部，避免因为数据不一致而引起的管理混乱和决策错误，有利于提高集团的管理效率和决策质量。
+
+- 隔离升级：当在对系统集群升级时，有可能因为某些原因需要进行版本回滚，传统的升级模式往往会因为元数据不兼容的原因无法回滚。而使用 CCR 
可以解决该问题，先构建一个备用的集群进行升级并双跑验证，用户可以依次升级各个集群，同时 CCR 也不依赖特定版本，使版本的回滚变得可行。
+
+### 任务类别
+
+CCR 支持两个类别的任务，分别是库级别和表级别，库级别的任务同步一个库的数据，表级别的任务只同步一个表的数据。
+
+## 原理与架构
+
+### 名词解释
+
+源集群：源头集群，业务数据写入的集群，需要 2.0 版本
+
+目标集群：跨集群同步的目标集群，需要 2.0 版本
+
+binlog：源集群的变更日志，包括 schema 和数据变更
+
+syncer：一个轻量级的进程
+
+上游：库级别任务时指上游库，表级别任务时指上游表。
+
+下游：库级别任务时指下游库，表级别人物时指下游表。
+
+### 架构说明
+
+![ccr 架构说明](/images/ccr-architecture-description.png)
+
+CCR 工具主要依赖一个轻量级进程：Syncers。Syncers 会从源集群获取 
binlog，直接将元数据应用于目标集群，通知目标集群从源集群拉取数据。从而实现全量和增量迁移。
+
+### 同步方式
+
+CCR 支持四种同步方式：
+
+| 同步方式    |   原理    |      触发时机     |
+|------------|-----------|------------------|
+| Full Sync  |  上游全量backup，下游restore。 | 首次同步或者操作触发，操作见功能列表。 |
+| Partial Sync  |  上游表或者分区级别 Backup，下游表或者分区级别restore。 | 操作触发，操作见功能列表。 |
+| TXN  |  增量数据同步，上游提交之后，下游开始同步。 | 操作触发，操作见功能列表。 |
+| SQL  |  在下游回放上游操作的 SQL。 | 操作触发，操作见功能列表。 |
+
+## 使用
+
+使用非常简单，只需把 Syncers 服务启动，给他发一个命令，剩下的交给 Syncers 完成就行。
+
+**1. 部署源 Doris 集群**
+
+**2. 部署目标 Doris 集群**
+
+**3. 首先源集群和目标集群都需要打开 binlog，在源集群和目标集群的 fe.conf 和 be.conf 中配置如下信息：**
+
+```sql
+enable_feature_binlog=true
+```
+
+**4. 部署 syncers**
+
+1. 构建 CCR syncer
+
+    ```shell
+    git clone https://github.com/selectdb/ccr-syncer
+
+    cd ccr-syncer
+
+    bash build.sh <-j NUM_OF_THREAD> <--output SYNCER_OUTPUT_DIR>
+
+    cd SYNCER_OUTPUT_DIR# 联系相关同学免费获取 ccr 二进制包
+    ```
+
+2. 启动和停止 syncer
+
+    ```shell
+    # 启动
+    cd bin && sh start_syncer.sh --daemon
+
+    # 停止
+    sh stop_syncer.sh
+    ```
+
+**5. 打开源集群中同步库/表的 Binlog**
+
+```shell
+-- 如果是整库同步，可以执行如下脚本，使得该库下面所有的表都要打开 binlog.enable
+vim shell/enable_db_binlog.sh
+修改源集群的 host、port、user、password、db
+或者 ./enable_db_binlog.sh --host $host --port $port --user $user --password 
$password --db $db
+
+-- 如果是单表同步，则只需要打开 table 的 binlog.enable，在源集群上执行：
+ALTER TABLE enable_binlog SET ("binlog.enable" = "true");
+```
+
+**6. 向 syncer 发起同步任务**
+
+```shell
+curl -X POST -H "Content-Type: application/json" -d '{
+    "name": "ccr_test",
+    "src": {
+      "host": "localhost",
+      "port": "9030",
+      "thrift_port": "9020",
+      "user": "root",
+      "password": "",
+      "database": "your_db_name",
+      "table": "your_table_name"
+    },
+    "dest": {
+      "host": "localhost",
+      "port": "9030",
+      "thrift_port": "9020",
+      "user": "root",
+      "password": "",
+      "database": "your_db_name",
+      "table": "your_table_name"
+    }
+}' http://127.0.0.1:9190/create_ccr
+```
+
+同步任务的参数说明：
+
+```shell
+name: CCR同步任务的名称，唯一即可
+host、port：对应集群 Master FE的host和mysql(jdbc) 的端口
+user、password：syncer以何种身份去开启事务、拉取数据等
+database、table：
+如果是库级别的同步，则填入your_db_name，your_table_name为空
+如果是表级别同步，则需要填入your_db_name，your_table_name
+向syncer发起同步任务中的name只能使用一次
+```
+
+## Syncer 详细操作手册
+
+### 启动 Syncer 说明
+
+根据配置选项启动 Syncer，并且在默认或指定路径下保存一个 pid 文件，pid 文件的命名方式为`host_port.pid`。
+
+**输出路径下的文件结构**
+
+在编译完成后的输出路径下，文件结构大致如下所示：
+
+```sql
+output_dir
+    bin
+        ccr_syncer
+        enable_db_binlog.sh
+        start_syncer.sh
+        stop_syncer.sh
+    db
+        [ccr.db] # 默认配置下运行后生成
+    log
+        [ccr_syncer.log] # 默认配置下运行后生成
+```
+
+:::caution
+**后文中的 start_syncer.sh 指的是该路径下的 start_syncer.sh！！！**
+:::
+
+**启动选项**
+
+1. --daemon
+
+后台运行 Syncer，默认为 false
+
+```sql
+bash bin/start_syncer.sh --daemon
+```
+
+2. --db_type
+
+Syncer 目前能够使用两种数据库来保存自身的元数据，分别为`sqlite3`（对应本地存储）和`mysql`（本地或远端存储）
+
+```sql
+bash bin/start_syncer.sh --db_type mysql
+```
+
+默认值为 sqlite3
+
+在使用 mysql 存储元数据时，Syncer 会使用`CREATE IF NOT EXISTS`来创建一个名为`ccr`的库，ccr 
相关的元数据表都会保存在其中
+
+3. --db_dir
+
+**这个选项仅在 db 使用****`sqlite3`****时生效**
+
+可以通过此选项来指定 sqlite3 生成的 db 文件名及路径。
+
+```sql
+bash bin/start_syncer.sh --db_dir /path/to/ccr.db
+```
+
+默认路径为`SYNCER_OUTPUT_DIR/db`，文件名为`ccr.db`
+
+4. --db_host & db_port & db_user & db_password
+
+**这个选项仅在 db 使用****`mysql`****时生效**

Review Comment:
   ```suggestion
   **这个选项仅在 db 使用`mysql`时生效**
   ```



##########
i18n/zh-CN/docusaurus-plugin-content-docs/current/admin-manual/data-admin/ccr/overview.md:
##########
@@ -0,0 +1,647 @@
+---
+{
+    "title": "跨集群数据同步",
+    "language": "zh-CN"
+}
+---
+
+<!--
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+  http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+
+## 概览
+
+### CCR 是什么
+
+CCR(Cross Cluster Replication) 
是跨集群数据同步，能够在库/表级别将源集群的数据变更同步到目标集群，可用于在线服务的数据可用性、隔离在离线负载、建设两地三中心。
+
+### 适用场景
+
+CCR 通常被用于容灾备份、读写分离、集团与公司间数据传输和隔离升级等场景。
+
+- 容灾备份：通常是将企业的数据备份到另一个集群与机房中，当突发事件导致业务中断或丢失时，可以从备份中恢复数据或快速进行主备切换。一般在对 SLA 
要求比较高的场景中，都需要进行容灾备份，比如在金融、医疗、电子商务等领域中比较常见。
+
+- 
读写分离：读写分离是将数据的查询操作和写入操作进行分离，目的是降低读写操作的相互影响并提升资源的利用率。比如在数据库写入压力过大或在高并发场景中，采用读写分离可以将读/写操作分散到多个地域的只读/只写的数据库案例上，减少读写间的互相影响，有效保证数据库的性能及稳定性。
+
+- 
集团与分公司间数据传输：集团总部为了对集团内数据进行统一管控和分析，通常需要分布在各地域的分公司及时将数据传输同步到集团总部，避免因为数据不一致而引起的管理混乱和决策错误，有利于提高集团的管理效率和决策质量。
+
+- 隔离升级：当在对系统集群升级时，有可能因为某些原因需要进行版本回滚，传统的升级模式往往会因为元数据不兼容的原因无法回滚。而使用 CCR 
可以解决该问题，先构建一个备用的集群进行升级并双跑验证，用户可以依次升级各个集群，同时 CCR 也不依赖特定版本，使版本的回滚变得可行。
+
+### 任务类别
+
+CCR 支持两个类别的任务，分别是库级别和表级别，库级别的任务同步一个库的数据，表级别的任务只同步一个表的数据。
+
+## 原理与架构
+
+### 名词解释
+
+源集群：源头集群，业务数据写入的集群，需要 2.0 版本
+
+目标集群：跨集群同步的目标集群，需要 2.0 版本
+
+binlog：源集群的变更日志，包括 schema 和数据变更
+
+syncer：一个轻量级的进程
+
+上游：库级别任务时指上游库，表级别任务时指上游表。
+
+下游：库级别任务时指下游库，表级别人物时指下游表。
+
+### 架构说明
+
+![ccr 架构说明](/images/ccr-architecture-description.png)
+
+CCR 工具主要依赖一个轻量级进程：Syncers。Syncers 会从源集群获取 
binlog，直接将元数据应用于目标集群，通知目标集群从源集群拉取数据。从而实现全量和增量迁移。
+
+### 同步方式
+
+CCR 支持四种同步方式：
+
+| 同步方式    |   原理    |      触发时机     |
+|------------|-----------|------------------|
+| Full Sync  |  上游全量backup，下游restore。 | 首次同步或者操作触发，操作见功能列表。 |
+| Partial Sync  |  上游表或者分区级别 Backup，下游表或者分区级别restore。 | 操作触发，操作见功能列表。 |
+| TXN  |  增量数据同步，上游提交之后，下游开始同步。 | 操作触发，操作见功能列表。 |
+| SQL  |  在下游回放上游操作的 SQL。 | 操作触发，操作见功能列表。 |
+
+## 使用
+
+使用非常简单，只需把 Syncers 服务启动，给他发一个命令，剩下的交给 Syncers 完成就行。
+
+**1. 部署源 Doris 集群**
+
+**2. 部署目标 Doris 集群**
+
+**3. 首先源集群和目标集群都需要打开 binlog，在源集群和目标集群的 fe.conf 和 be.conf 中配置如下信息：**
+
+```sql
+enable_feature_binlog=true
+```
+
+**4. 部署 syncers**
+
+1. 构建 CCR syncer
+
+    ```shell
+    git clone https://github.com/selectdb/ccr-syncer
+
+    cd ccr-syncer
+
+    bash build.sh <-j NUM_OF_THREAD> <--output SYNCER_OUTPUT_DIR>
+
+    cd SYNCER_OUTPUT_DIR# 联系相关同学免费获取 ccr 二进制包
+    ```
+
+2. 启动和停止 syncer
+
+    ```shell
+    # 启动
+    cd bin && sh start_syncer.sh --daemon
+
+    # 停止
+    sh stop_syncer.sh
+    ```
+
+**5. 打开源集群中同步库/表的 Binlog**
+
+```shell
+-- 如果是整库同步，可以执行如下脚本，使得该库下面所有的表都要打开 binlog.enable
+vim shell/enable_db_binlog.sh
+修改源集群的 host、port、user、password、db
+或者 ./enable_db_binlog.sh --host $host --port $port --user $user --password 
$password --db $db
+
+-- 如果是单表同步，则只需要打开 table 的 binlog.enable，在源集群上执行：
+ALTER TABLE enable_binlog SET ("binlog.enable" = "true");
+```
+
+**6. 向 syncer 发起同步任务**
+
+```shell
+curl -X POST -H "Content-Type: application/json" -d '{
+    "name": "ccr_test",
+    "src": {
+      "host": "localhost",
+      "port": "9030",
+      "thrift_port": "9020",
+      "user": "root",
+      "password": "",
+      "database": "your_db_name",
+      "table": "your_table_name"
+    },
+    "dest": {
+      "host": "localhost",
+      "port": "9030",
+      "thrift_port": "9020",
+      "user": "root",
+      "password": "",
+      "database": "your_db_name",
+      "table": "your_table_name"
+    }
+}' http://127.0.0.1:9190/create_ccr
+```
+
+同步任务的参数说明：
+
+```shell
+name: CCR同步任务的名称，唯一即可
+host、port：对应集群 Master FE的host和mysql(jdbc) 的端口
+user、password：syncer以何种身份去开启事务、拉取数据等
+database、table：
+如果是库级别的同步，则填入your_db_name，your_table_name为空
+如果是表级别同步，则需要填入your_db_name，your_table_name
+向syncer发起同步任务中的name只能使用一次
+```
+
+## Syncer 详细操作手册
+
+### 启动 Syncer 说明
+
+根据配置选项启动 Syncer，并且在默认或指定路径下保存一个 pid 文件，pid 文件的命名方式为`host_port.pid`。
+
+**输出路径下的文件结构**
+
+在编译完成后的输出路径下，文件结构大致如下所示：
+
+```sql
+output_dir
+    bin
+        ccr_syncer
+        enable_db_binlog.sh
+        start_syncer.sh
+        stop_syncer.sh
+    db
+        [ccr.db] # 默认配置下运行后生成
+    log
+        [ccr_syncer.log] # 默认配置下运行后生成
+```
+
+:::caution
+**后文中的 start_syncer.sh 指的是该路径下的 start_syncer.sh！！！**
+:::
+
+**启动选项**
+
+1. --daemon
+
+后台运行 Syncer，默认为 false
+
+```sql
+bash bin/start_syncer.sh --daemon
+```
+
+2. --db_type
+
+Syncer 目前能够使用两种数据库来保存自身的元数据，分别为`sqlite3`（对应本地存储）和`mysql`（本地或远端存储）
+
+```sql
+bash bin/start_syncer.sh --db_type mysql
+```
+
+默认值为 sqlite3
+
+在使用 mysql 存储元数据时，Syncer 会使用`CREATE IF NOT EXISTS`来创建一个名为`ccr`的库，ccr 
相关的元数据表都会保存在其中
+
+3. --db_dir
+
+**这个选项仅在 db 使用****`sqlite3`****时生效**

Review Comment:
   ```suggestion
   **这个选项仅在 db 使用`sqlite3`时生效**
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@doris.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@doris.apache.org
For additional commands, e-mail: commits-h...@doris.apache.org

Re: [PR] [improvement] split ccr.md to overview.md and feature.md [doris-website]

Reply via email to