This is an automated email from the ASF dual-hosted git repository. jiafengzheng pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/doris.git
The following commit(s) were added to refs/heads/master by this push: new 2827ced1f6 [improvement](doc)Import data example from hive partition table (#11732) 2827ced1f6 is described below commit 2827ced1f6a3da76c4e47ff3a20eb71360732aba Author: jiafeng.zhang <zhang...@gmail.com> AuthorDate: Fri Aug 12 19:38:45 2022 +0800 [improvement](doc)Import data example from hive partition table (#11732) Import data example from hive partition table --- .../import/import-way/spark-load-manual.md | 70 ++++++++++++++++++++++ .../import/import-way/spark-load-manual.md | 70 ++++++++++++++++++++++ 2 files changed, 140 insertions(+) diff --git a/docs/en/docs/data-operate/import/import-way/spark-load-manual.md b/docs/en/docs/data-operate/import/import-way/spark-load-manual.md index 49dfc5628b..1d1d47e78f 100644 --- a/docs/en/docs/data-operate/import/import-way/spark-load-manual.md +++ b/docs/en/docs/data-operate/import/import-way/spark-load-manual.md @@ -483,6 +483,75 @@ PROPERTIES ``` +Example 4: Import data from hive partitioned table + +```sql +-- hive create table statement +create table test_partition( +id int, +name string, +age int +) +partitioned by (dt string) +row format delimited fields terminated by ',' +stored as textfile; + +-- doris create table statement +CREATE TABLE IF NOT EXISTS test_partition_04 +( +dt date, +id int, +name string, +age int +) +UNIQUE KEY(`dt`, `id`) +DISTRIBUTED BY HASH(`id`) BUCKETS 1 +PROPERTIES ( +"replication_allocation" = "tag.location.default: 1" +); +-- spark load +CREATE EXTERNAL RESOURCE "spark_resource" +PROPERTIES +( +"type" = "spark", +"spark.master" = "yarn", +"spark.submit.deployMode" = "cluster", +"spark.executor.memory" = "1g", +"spark.yarn.queue" = "default", +"spark.hadoop.yarn.resourcemanager.address" = "localhost:50056", +"spark.hadoop.fs.defaultFS" = "hdfs://localhost:9000", +"working_dir" = "hdfs://localhost:9000/tmp/doris", +"broker" = "broker_01" +); +LOAD LABEL demo.test_hive_partition_table_18 +( + DATA INFILE("hdfs://localhost:9000/user/hive/warehouse/demo.db/test/dt=2022-08-01/*") + INTO TABLE test_partition_04 + COLUMNS TERMINATED BY "," + FORMAT AS "csv" + (id,name,age) + COLUMNS FROM PATH AS (`dt`) + SET + ( + dt=dt, + id=id, + name=name, + age=age + ) +) +WITH RESOURCE 'spark_resource' +( + "spark.executor.memory" = "1g", + "spark.shuffle.compress" = "true" +) +PROPERTIES +( + "timeout" = "3600" +); +```` + + + You can view the details syntax about creating load by input `help spark load`. This paper mainly introduces the parameter meaning and precautions in the creation and load syntax of spark load. **Label** @@ -647,6 +716,7 @@ The most suitable scenario to use spark load is that the raw data is in the file ## FAQ +* Spark load does not yet support the import of Doris table fields that are of type String. If your table fields are of type String, please change them to type varchar, otherwise the import will fail, prompting `type:ETL_QUALITY_UNSATISFIED; msg:quality not good enough to cancel` * When using spark load, the `HADOOP_CONF_DIR` environment variable is no set in the `spark-env.sh`. If the `HADOOP_CONF_DIR` environment variable is not set, the error `When running with master 'yarn' either HADOOP_CONF_DIR or YARN_CONF_DIR must be set in the environment` will be reported. diff --git a/docs/zh-CN/docs/data-operate/import/import-way/spark-load-manual.md b/docs/zh-CN/docs/data-operate/import/import-way/spark-load-manual.md index d8bc642296..c33fbf96fa 100644 --- a/docs/zh-CN/docs/data-operate/import/import-way/spark-load-manual.md +++ b/docs/zh-CN/docs/data-operate/import/import-way/spark-load-manual.md @@ -449,6 +449,75 @@ PROPERTIES ); ``` +示例4: 导入 hive 分区表的数据 + +```sql +--hive 建表语句 +create table test_partition( + id int, + name string, + age int +) +partitioned by (dt string) +row format delimited fields terminated by ',' +stored as textfile; + +--doris 建表语句 +CREATE TABLE IF NOT EXISTS test_partition_04 +( + dt date, + id int, + name string, + age int +) +UNIQUE KEY(`dt`, `id`) +DISTRIBUTED BY HASH(`id`) BUCKETS 1 +PROPERTIES ( + "replication_allocation" = "tag.location.default: 1" +); +--spark load 语句 +CREATE EXTERNAL RESOURCE "spark_resource" +PROPERTIES +( +"type" = "spark", +"spark.master" = "yarn", +"spark.submit.deployMode" = "cluster", +"spark.executor.memory" = "1g", +"spark.yarn.queue" = "default", +"spark.hadoop.yarn.resourcemanager.address" = "localhost:50056", +"spark.hadoop.fs.defaultFS" = "hdfs://localhost:9000", +"working_dir" = "hdfs://localhost:9000/tmp/doris", +"broker" = "broker_01" +); +LOAD LABEL demo.test_hive_partition_table_18 +( + DATA INFILE("hdfs://localhost:9000/user/hive/warehouse/demo.db/test/dt=2022-08-01/*") + INTO TABLE test_partition_04 + COLUMNS TERMINATED BY "," + FORMAT AS "csv" + (id,name,age) + COLUMNS FROM PATH AS (`dt`) + SET + ( + dt=dt, + id=id, + name=name, + age=age + ) +) +WITH RESOURCE 'spark_resource' +( + "spark.executor.memory" = "1g", + "spark.shuffle.compress" = "true" +) +PROPERTIES +( + "timeout" = "3600" +); +``` + + + 创建导入的详细语法执行 `HELP SPARK LOAD` 查看语法帮助。这里主要介绍 Spark load 的创建导入语法中参数意义和注意事项。 **Label** @@ -603,6 +672,7 @@ LoadFinishTime: 2019-07-27 11:50:16 ## 常见问题 +- 现在Spark load 还不支持 Doris 表字段是String类型的导入,如果你的表字段有String类型的请改成varchar类型,不然会导入失败,提示 `type:ETL_QUALITY_UNSATISFIED; msg:quality not good enough to cancel` - 使用 Spark Load 时没有在 spark 客户端的 `spark-env.sh` 配置 `HADOOP_CONF_DIR` 环境变量。 如果 `HADOOP_CONF_DIR` 环境变量没有设置,会报 `When running with master 'yarn' either HADOOP_CONF_DIR or YARN_CONF_DIR must be set in the environment.` 错误。 --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@doris.apache.org For additional commands, e-mail: commits-h...@doris.apache.org