[GitHub] [incubator-doris] wyb commented on a change in pull request #3418: [Spark load] Add spark etl cluster and cluster manager

GitBox Tue, 19 May 2020 07:47:19 -0700


wyb commented on a change in pull request #3418:
URL: https://github.com/apache/incubator-doris/pull/3418#discussion_r427361798




##########
File path: docs/zh-CN/administrator-guide/load-data/spark-load-manual.md
##########
@@ -0,0 +1,351 @@
+---                                                                            
     
+{
+    "title": "Spark Load",
+    "language": "zh-CN"
+}
+---  
+
+<!-- 
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+  http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+
+# Spark Load
+
+Spark load 通过 Spark 实现对导入数据的预处理，提高 Doris 大数据量的导入性能并且节省 Doris 
集群的计算资源。主要用于初次迁移，大数据量导入 Doris 的场景。
+
+Spark load 是一种异步导入方式，用户需要通过 MySQL 协议创建 Spark 类型导入任务，并通过 `SHOW LOAD` 查看导入结果。
+
+
+
+## 适用场景
+
+* 源数据在 Spark 可以访问的存储系统中，如 HDFS。
+* 数据量在 几十 GB 到 TB 级别。
+
+
+
+## 名词解释
+
+1. Frontend（FE）：Doris 系统的元数据和调度节点。在导入流程中主要负责导入任务的调度工作。
+2. Backend（BE）：Doris 系统的计算和存储节点。在导入流程中主要负责数据写入及存储。
+3. Spark ETL：在导入流程中主要负责数据的 ETL 工作，包括全局字典构建（BITMAP类型）、分区、排序、聚合等。
+4. Broker：Broker 为一个独立的无状态进程。封装了文件系统接口，提供 Doris 读取远端存储系统中文件的能力。
+
+
+## 基本原理
+
+### 基本流程
+
+用户通过 MySQL 客户端提交 Spark 类型导入任务，FE记录元数据并返回用户提交成功。
+
+Spark load 任务的执行主要分为以下5个阶段。
+
+1. FE 调度提交 ETL 任务到 Spark 集群执行。
+2. Spark 集群执行 ETL 完成对导入数据的预处理。包括全局字典构建（BITMAP类型）、分区、排序、聚合等。
+3. ETL 任务完成后，FE 获取预处理过的每个分片的数据路径，并调度相关的 BE 执行 Push 任务。
+4. BE 通过 Broker 读取数据，转化为 Doris 底层存储格式。
+5. FE 调度生效版本，完成导入任务。
+
+```
+                 +
+                 | 0. User create spark load job
+            +----v----+
+            |   FE    |---------------------------------+
+            +----+----+                                 |
+                 | 3. FE send push tasks                |
+                 | 5. FE publish version                |
+    +------------+------------+                         |
+    |            |            |                         |
++---v---+    +---v---+    +---v---+                     |
+|  BE   |    |  BE   |    |  BE   |                     |1. FE submit Spark 
ETL job
++---^---+    +---^---+    +---^---+                     |
+    |4. BE push with broker   |                         |
++---+---+    +---+---+    +---+---+                     |
+|Broker |    |Broker |    |Broker |                     |
++---^---+    +---^---+    +---^---+                     |
+    |            |            |                         |
++---+------------+------------+---+ 2.ETL +-------------v---------------+
+|               HDFS              +------->       Spark cluster         |
+|                                 <-------+                             |
++---------------------------------+       +-----------------------------+
+
+```
+
+
+
+### 全局字典
+
+待补
+
+
+
+### 数据预处理（DPP）
+
+待补
+
+
+
+## 基本操作
+
+### 配置 ETL 集群
+
+提交 Spark 导入任务之前，需要配置执行 ETL 任务的 Spark 集群。
+
+语法：
+
+```sql
+-- 添加 ETL 集群
+ALTER SYSTEM ADD LOAD CLUSTER cluster_name
+PROPERTIES("key1" = "value1", ...)
+
+-- 删除 ETL 集群
+ALTER SYSTEM DROP LOAD CLUSTER cluster_name
+
+-- 查看 ETL 集群
+SHOW LOAD CLUSTERS
+SHOW PROC "/load_etl_clusters"
+```
+
+`cluster_name` 为 Doris 中配置的 Spark 集群的名字。
+
+PROPERTIES 是 ETL 集群相关参数，如下：
+
+- `type`：集群类型，必填，目前仅支持 spark。
+
+- Spark ETL 集群相关参数如下：
+  - `master`：必填，目前支持yarn，spark://host:port。
+  - `deploy_mode`： 可选，默认为 cluster。支持 cluster，client 两种。
+  - `hdfs_etl_path`：ETL 使用的 HDFS 目录。必填。例如：hdfs://host:port/tmp/doris。
+  - `broker`：broker 名字。必填。需要使用`ALTER SYSTEM ADD BROKER` 命令提前完成配置。
+  - `yarn_configs`： HDFS YARN 参数，master 为 yarn 时必填。需要指定 
yarn.resourcemanager.address 和 fs.defaultFS。不同 configs 之间使用`;`拼接。
+  - `spark_args`： Spark 任务提交时指定的参数，可选。具体可参考 spark-submit 命令，每个 arg  
必须以`--`开头，不同 args 之间使用`;`拼接。例如--files=/file1,/file2;--jars=/a.jar,/b.jar。
+  - `spark_configs`： Spark 
参数，可选。具体参数可参考http://spark.apache.org/docs/latest/configuration.html。不同 configs 
之间使用`;`拼接。
+
+示例：
+
+```sql
+-- yarn cluster 模式 
+ALTER SYSTEM ADD LOAD CLUSTER "cluster0"
+PROPERTIES
+(
+"type" = "spark", 
+"master" = "yarn",
+"hdfs_etl_path" = "hdfs://1.1.1.1:801/tmp/doris",
+"broker" = "broker0",

Review comment:
       the broker is used to read the ETL intermediate results in the 
working_dir, not user source data.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [incubator-doris] wyb commented on a change in pull request #3418: [Spark load] Add spark etl cluster and cluster manager

Reply via email to