[doris] branch master updated: [improvement](doc)Import data example from hive partition table (#11732)

jiafengzheng Fri, 12 Aug 2022 04:39:56 -0700

This is an automated email from the ASF dual-hosted git repository.

jiafengzheng pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/doris.git



The following commit(s) were added to refs/heads/master by this push:
     new 2827ced1f6 [improvement](doc)Import data example from hive partition 
table (#11732)
2827ced1f6 is described below

commit 2827ced1f6a3da76c4e47ff3a20eb71360732aba
Author: jiafeng.zhang <zhang...@gmail.com>
AuthorDate: Fri Aug 12 19:38:45 2022 +0800

    [improvement](doc)Import data example from hive partition table (#11732)
    
    Import data example from hive partition table
---
 .../import/import-way/spark-load-manual.md         | 70 ++++++++++++++++++++++
 .../import/import-way/spark-load-manual.md         | 70 ++++++++++++++++++++++
 2 files changed, 140 insertions(+)

diff --git a/docs/en/docs/data-operate/import/import-way/spark-load-manual.md 
b/docs/en/docs/data-operate/import/import-way/spark-load-manual.md
index 49dfc5628b..1d1d47e78f 100644
--- a/docs/en/docs/data-operate/import/import-way/spark-load-manual.md
+++ b/docs/en/docs/data-operate/import/import-way/spark-load-manual.md
@@ -483,6 +483,75 @@ PROPERTIES
 
 ```
 
+Example 4: Import data from hive partitioned table
+
+```sql
+-- hive create table statement
+create table test_partition(
+id int,
+name string,
+age int
+)
+partitioned by (dt string)
+row format delimited fields terminated by ','
+stored as textfile;
+
+-- doris create table statement
+CREATE TABLE IF NOT EXISTS test_partition_04
+(
+dt date,
+id int,
+name string,
+age int
+)
+UNIQUE KEY(`dt`, `id`)
+DISTRIBUTED BY HASH(`id`) BUCKETS 1
+PROPERTIES (
+"replication_allocation" = "tag.location.default: 1"
+);
+-- spark load 
+CREATE EXTERNAL RESOURCE "spark_resource"
+PROPERTIES
+(
+"type" = "spark",
+"spark.master" = "yarn",
+"spark.submit.deployMode" = "cluster",
+"spark.executor.memory" = "1g",
+"spark.yarn.queue" = "default",
+"spark.hadoop.yarn.resourcemanager.address" = "localhost:50056",
+"spark.hadoop.fs.defaultFS" = "hdfs://localhost:9000",
+"working_dir" = "hdfs://localhost:9000/tmp/doris",
+"broker" = "broker_01"
+);
+LOAD LABEL demo.test_hive_partition_table_18
+(
+    DATA 
INFILE("hdfs://localhost:9000/user/hive/warehouse/demo.db/test/dt=2022-08-01/*")
+    INTO TABLE test_partition_04
+    COLUMNS TERMINATED BY ","
+    FORMAT AS "csv"
+    (id,name,age)
+    COLUMNS FROM PATH AS (`dt`)
+    SET
+    (
+        dt=dt,
+        id=id,
+        name=name,
+        age=age
+    )
+)
+WITH RESOURCE 'spark_resource'
+(
+    "spark.executor.memory" = "1g",
+    "spark.shuffle.compress" = "true"
+)
+PROPERTIES
+(
+    "timeout" = "3600"
+);
+````
+
+
+
 You can view the details syntax about creating load by input `help spark 
load`. This paper mainly introduces the parameter meaning and precautions in 
the creation and load syntax of spark load.
 
 **Label**
@@ -647,6 +716,7 @@ The most suitable scenario to use spark load is that the 
raw data is in the file
 
 ## FAQ
 
+* Spark load does not yet support the import of Doris table fields that are of 
type String. If your table fields are of type String, please change them to 
type varchar, otherwise the import will fail, prompting 
`type:ETL_QUALITY_UNSATISFIED; msg:quality not good enough to cancel`
 * When using spark load, the `HADOOP_CONF_DIR` environment variable is no set 
in the `spark-env.sh`.
 
 If the `HADOOP_CONF_DIR` environment variable is not set, the error `When 
running with master 'yarn' either HADOOP_CONF_DIR or YARN_CONF_DIR must be set 
in the environment` will be reported.
diff --git 
a/docs/zh-CN/docs/data-operate/import/import-way/spark-load-manual.md 
b/docs/zh-CN/docs/data-operate/import/import-way/spark-load-manual.md
index d8bc642296..c33fbf96fa 100644
--- a/docs/zh-CN/docs/data-operate/import/import-way/spark-load-manual.md
+++ b/docs/zh-CN/docs/data-operate/import/import-way/spark-load-manual.md
@@ -449,6 +449,75 @@ PROPERTIES
 );
 ```
 
+示例4： 导入 hive 分区表的数据
+
+```sql
+--hive 建表语句
+create table test_partition(
+       id int,
+       name string,
+       age int
+)
+partitioned by (dt string)
+row format delimited fields terminated by ','
+stored as textfile;
+
+--doris 建表语句
+CREATE TABLE IF NOT EXISTS test_partition_04
+(
+       dt date,
+       id int,
+       name string,
+       age int
+)
+UNIQUE KEY(`dt`, `id`)
+DISTRIBUTED BY HASH(`id`) BUCKETS 1
+PROPERTIES (
+       "replication_allocation" = "tag.location.default: 1"
+);
+--spark load 语句
+CREATE EXTERNAL RESOURCE "spark_resource"
+PROPERTIES
+(
+"type" = "spark",
+"spark.master" = "yarn",
+"spark.submit.deployMode" = "cluster",
+"spark.executor.memory" = "1g",
+"spark.yarn.queue" = "default",
+"spark.hadoop.yarn.resourcemanager.address" = "localhost:50056",
+"spark.hadoop.fs.defaultFS" = "hdfs://localhost:9000",
+"working_dir" = "hdfs://localhost:9000/tmp/doris",
+"broker" = "broker_01"
+);
+LOAD LABEL demo.test_hive_partition_table_18
+(
+    DATA 
INFILE("hdfs://localhost:9000/user/hive/warehouse/demo.db/test/dt=2022-08-01/*")
+    INTO TABLE test_partition_04
+    COLUMNS TERMINATED BY ","
+    FORMAT AS "csv"
+    (id,name,age)
+    COLUMNS FROM PATH AS (`dt`)
+    SET
+    (
+        dt=dt,
+        id=id,
+        name=name,
+        age=age
+    )
+)
+WITH RESOURCE 'spark_resource'
+(
+    "spark.executor.memory" = "1g",
+    "spark.shuffle.compress" = "true"
+)
+PROPERTIES
+(
+    "timeout" = "3600"
+);
+```
+
+
+
 创建导入的详细语法执行 `HELP SPARK LOAD` 查看语法帮助。这里主要介绍 Spark load 的创建导入语法中参数意义和注意事项。
 
 **Label**
@@ -603,6 +672,7 @@ LoadFinishTime: 2019-07-27 11:50:16
 
 ## 常见问题
 
+- 现在Spark load 还不支持 Doris 
表字段是String类型的导入，如果你的表字段有String类型的请改成varchar类型，不然会导入失败，提示 
`type:ETL_QUALITY_UNSATISFIED; msg:quality not good enough to cancel`
 - 使用 Spark Load 时没有在 spark 客户端的 `spark-env.sh` 配置 `HADOOP_CONF_DIR` 环境变量。
 
 如果 `HADOOP_CONF_DIR` 环境变量没有设置，会报 `When running with master 'yarn' either 
HADOOP_CONF_DIR or YARN_CONF_DIR must be set in the environment.` 错误。


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@doris.apache.org
For additional commands, e-mail: commits-h...@doris.apache.org

[doris] branch master updated: [improvement](doc)Import data example from hive partition table (#11732)

Reply via email to