[doris-website] branch master updated: spark load example

jiafengzheng Fri, 12 Aug 2022 00:48:12 -0700

This is an automated email from the ASF dual-hosted git repository.

jiafengzheng pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/doris-website.git



The following commit(s) were added to refs/heads/master by this push:
     new 0ffdbb2dcdf spark load example
0ffdbb2dcdf is described below

commit 0ffdbb2dcdfb9d5e5d0e41be0d7a48584740e6ed
Author: jiafeng.zhang <zhang...@gmail.com>
AuthorDate: Fri Aug 12 15:47:54 2022 +0800

    spark load example
---
 .../import/import-way/spark-load-manual.md         | 68 +++++++++++++++++++++
 .../import/import-way/spark-load-manual.md         | 70 ++++++++++++++++++++++
 2 files changed, 138 insertions(+)

diff --git a/docs/data-operate/import/import-way/spark-load-manual.md 
b/docs/data-operate/import/import-way/spark-load-manual.md
index a1c314ec377..a1ebce9837b 100644
--- a/docs/data-operate/import/import-way/spark-load-manual.md
+++ b/docs/data-operate/import/import-way/spark-load-manual.md
@@ -487,6 +487,73 @@ PROPERTIES
 
 ```
 
+Example 4: Import data from hive partitioned table
+
+```sql
+-- hive create table statement
+create table test_partition(
+id int,
+name string,
+age int
+)
+partitioned by (dt string)
+row format delimited fields terminated by ','
+stored as textfile;
+
+-- doris create table statement
+CREATE TABLE IF NOT EXISTS test_partition_04
+(
+dt date,
+id int,
+name string,
+age int
+)
+UNIQUE KEY(`dt`, `id`)
+DISTRIBUTED BY HASH(`id`) BUCKETS 1
+PROPERTIES (
+"replication_allocation" = "tag.location.default: 1"
+);
+-- spark load 
+CREATE EXTERNAL RESOURCE "spark_resource"
+PROPERTIES
+(
+"type" = "spark",
+"spark.master" = "yarn",
+"spark.submit.deployMode" = "cluster",
+"spark.executor.memory" = "1g",
+"spark.yarn.queue" = "default",
+"spark.hadoop.yarn.resourcemanager.address" = "localhost:50056",
+"spark.hadoop.fs.defaultFS" = "hdfs://localhost:9000",
+"working_dir" = "hdfs://localhost:9000/tmp/doris",
+"broker" = "broker_01"
+);
+LOAD LABEL demo.test_hive_partition_table_18
+(
+    DATA 
INFILE("hdfs://localhost:9000/user/hive/warehouse/demo.db/test/dt=2022-08-01/*")
+    INTO TABLE test_partition_04
+    COLUMNS TERMINATED BY ","
+    FORMAT AS "csv"
+    (id,name,age)
+    COLUMNS FROM PATH AS (`dt`)
+    SET
+    (
+        dt=dt,
+        id=id,
+        name=name,
+        age=age
+    )
+)
+WITH RESOURCE 'spark_resource'
+(
+    "spark.executor.memory" = "1g",
+    "spark.shuffle.compress" = "true"
+)
+PROPERTIES
+(
+    "timeout" = "3600"
+);
+````
+
 You can view the details syntax about creating load by input `help spark 
load`. This paper mainly introduces the parameter meaning and precautions in 
the creation and load syntax of spark load.
 
 **Label**
@@ -651,6 +718,7 @@ The most suitable scenario to use spark load is that the 
raw data is in the file
 
 ## FAQ
 
+* Spark load does not yet support the import of Doris table fields that are of 
type String. If your table fields are of type String, please change them to 
type varchar, otherwise the import will fail, prompting 
`type:ETL_QUALITY_UNSATISFIED; msg:quality not good enough to cancel`
 * When using spark load, the `HADOOP_CONF_DIR` environment variable is no set 
in the `spark-env.sh`.
 
 If the `HADOOP_CONF_DIR` environment variable is not set, the error `When 
running with master 'yarn' either HADOOP_CONF_DIR or YARN_CONF_DIR must be set 
in the environment` will be reported.
diff --git 
a/i18n/zh-CN/docusaurus-plugin-content-docs/current/data-operate/import/import-way/spark-load-manual.md
 
b/i18n/zh-CN/docusaurus-plugin-content-docs/current/data-operate/import/import-way/spark-load-manual.md
index 6933709791e..443bac0f99f 100644
--- 
a/i18n/zh-CN/docusaurus-plugin-content-docs/current/data-operate/import/import-way/spark-load-manual.md
+++ 
b/i18n/zh-CN/docusaurus-plugin-content-docs/current/data-operate/import/import-way/spark-load-manual.md
@@ -449,6 +449,75 @@ PROPERTIES
 );
 ```
 
+示例4： 导入 hive 分区表的数据
+
+```sql
+--hive 建表语句
+create table test_partition(
+       id int,
+       name string,
+       age int
+)
+partitioned by (dt string)
+row format delimited fields terminated by ','
+stored as textfile;
+
+--doris 建表语句
+CREATE TABLE IF NOT EXISTS test_partition_04
+(
+       dt date,
+       id int,
+       name string,
+       age int
+)
+UNIQUE KEY(`dt`, `id`)
+DISTRIBUTED BY HASH(`id`) BUCKETS 1
+PROPERTIES (
+       "replication_allocation" = "tag.location.default: 1"
+);
+--spark load 语句
+CREATE EXTERNAL RESOURCE "spark_resource"
+PROPERTIES
+(
+"type" = "spark",
+"spark.master" = "yarn",
+"spark.submit.deployMode" = "cluster",
+"spark.executor.memory" = "1g",
+"spark.yarn.queue" = "default",
+"spark.hadoop.yarn.resourcemanager.address" = "localhost:50056",
+"spark.hadoop.fs.defaultFS" = "hdfs://localhost:9000",
+"working_dir" = "hdfs://localhost:9000/tmp/doris",
+"broker" = "broker_01"
+);
+LOAD LABEL demo.test_hive_partition_table_18
+(
+    DATA 
INFILE("hdfs://localhost:9000/user/hive/warehouse/demo.db/test/dt=2022-08-01/*")
+    INTO TABLE test_partition_04
+    COLUMNS TERMINATED BY ","
+    FORMAT AS "csv"
+    (id,name,age)
+    COLUMNS FROM PATH AS (`dt`)
+    SET
+    (
+        dt=dt,
+        id=id,
+        name=name,
+        age=age
+    )
+)
+WITH RESOURCE 'spark_resource'
+(
+    "spark.executor.memory" = "1g",
+    "spark.shuffle.compress" = "true"
+)
+PROPERTIES
+(
+    "timeout" = "3600"
+);
+```
+
+
+
 创建导入的详细语法执行 `HELP SPARK LOAD` 查看语法帮助。这里主要介绍 Spark load 的创建导入语法中参数意义和注意事项。
 
 **Label**
@@ -603,6 +672,7 @@ LoadFinishTime: 2019-07-27 11:50:16
 
 ## 常见问题
 
+- 现在Spark load 还不支持 Doris 
表字段是String类型的导入，如果你的表字段有String类型的请改成varchar类型，不然会导入失败，提示 
`type:ETL_QUALITY_UNSATISFIED; msg:quality not good enough to cancel`
 - 使用 Spark Load 时没有在 spark 客户端的 `spark-env.sh` 配置 `HADOOP_CONF_DIR` 环境变量。
 
 如果 `HADOOP_CONF_DIR` 环境变量没有设置，会报 `When running with master 'yarn' either 
HADOOP_CONF_DIR or YARN_CONF_DIR must be set in the environment.` 错误。


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@doris.apache.org
For additional commands, e-mail: commits-h...@doris.apache.org

[doris-website] branch master updated: spark load example

Reply via email to