[GitHub] [hudi] leesf commented on a change in pull request #3140: [HUDI-2063] Add Doc For Spark Sql Integrates With Hudi

GitBox Fri, 25 Jun 2021 18:26:18 -0700


leesf commented on a change in pull request #3140:
URL: https://github.com/apache/hudi/pull/3140#discussion_r659099744




##########
File path: docs/_docs/1_1_spark_quick_start_guide.md
##########
@@ -300,6 +300,221 @@ spark.
   show(100, false)
 ```
 
+# Spark-Sql example
+## Setup
+Hudi support using spark sql to write and read data with the 
**HoodieSparkSessionExtension** sql extension.
+```shell
+# spark sql for spark 3
+spark-sql --packages 
org.apache.hudi:hudi-spark3-bundle_2.12:0.8.0,org.apache.spark:spark-avro_2.12:3.0.1
 \
+--conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer' \
+--conf 
'spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension'
+
+# spark-sql for spark 2 with scala 2.11
+spark-sql --packages 
org.apache.hudi:hudi-spark-bundle_2.11:0.8.0,org.apache.spark:spark-avro_2.11:2.4.4
 \
+--conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer' \
+--conf 
'spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension'
+
+# spark-sql for spark 2 with scala 2.12
+spark-sql \
+  --packages 
org.apache.hudi:hudi-spark-bundle_2.12:0.8.0,org.apache.spark:spark-avro_2.12:2.4.4
 \
+  --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer' \
+  --conf 
'spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension'
+```
+
+## Sql syntax
+### DDL
+Hudi support create table using the spark-sql.
+**Create Non-Partitioned Table**
+```sql
+-- create a managed cow table
+create table if not exists h0(
+  id int, 
+  name string, 
+  price double
+) using hudi
+options (
+  type = 'cow',
+  primaryKey = 'id'
+);
+
+-- creae an exteranl mor table
+create table if not exists h1(
+  id int, 
+  name string, 
+  price double,
+  ts bigint
+) using hudi
+location '/tmp/hudi/h0'  
+options (
+  type = 'mor',
+  primaryKey = 'id,name',
+  preCombineField = 'ts' 
+)
+;
+
+-- create a non-primary key table
+create table if not exists h2(
+  id int, 
+  name string, 
+  price double
+) using hudi
+options (
+  type = 'cow'
+);
+```
+**Create Non-Partitioned Table**
+```sql
+create table if not exists h_p0 (
+id bigint,
+name string,
+dt string，
+hh string  
+) using hudi
+location '/tmp/hudi/h_p0'
+options (
+  type = 'cow',
+  primaryKey = 'id',
+  preCombineField = 'ts'
+ ) 
+partitioned by (dt, hh)
+;
+```
+**Create Table Options**
+
+| Parameter Name | Introduction |
+|------------|--------|
+| primaryKey | The primary key names of the table, multiple fields separated 
by commas. |
+| type       | The table type to create. type = 'cow' means a COPY-ON-WRITE 
table,while type = 'mor' means a MERGE-ON-READ table. Default value is 'cow' 
without specified this option.|
+| preCombineField | The Pre-Combine field of the table. |
+
+## DML
+### MergeInto
+Hudi support merge-into for both spark 2 & spark 3.
+**Syntax**
+```sql
+MERGE INTO tableIdentifier AS target_alias
+USING (sub_query | tableIdentifier) AS source_alias
+ON <merge_condition>
+[ WHEN MATCHED [ AND <condition> ] THEN <matched_action> ]
+[ WHEN MATCHED [ AND <condition> ] THEN <matched_action> ]
+[ WHEN NOT MATCHED [ AND <condition> ]  THEN <not_matched_action> ]
+
+<merge_condition> =A equal bool condition 
+<matched_action>  =
+  DELETE  |
+  UPDATE SET *  |
+  UPDATE SET column1 = expression1 [, column2 = expression2 ...]
+<not_matched_action>  =
+  INSERT *  |
+  INSERT (column1 [, column2 ...]) VALUES (value1 [, value2 ...])
+```
+**Case**
+```sql
+merge into h0 as target
+using (
+  select id, name, price, flag from s
+) source
+on target.id = source.id
+when matched then update set *
+when not matched then insert *
+;
+
+merge into h0
+using (
+  select id, name, price, flag from s
+) source
+on h0.id = source.id
+when matched and flag != 'delete' then update set id = source.id, name = 
source.name, price = source.price * 2
+when matched and flag = 'delete' then delete
+when not matched then insert (id,name,price) values(id, name, price)
+;
+```
+
+### Insert
+```sql
+insert into h0 select 1, 'a1', 20;
+
+-- insert static partition
+insert into h_p0 partition(dt = '2021-01-02') select 1, 'a1';
+
+-- insert dynamic partition
+insert into h_p0 select 1, 'a1', dt;
+
+-- insert dynamic partition
+insert into h_p1 select 1 as id, 'a1', '2021-01-03' as dt, '19' as hh;
+
+-- insert overwrite table
+insert overwrite table h0 select 1, 'a1', 20;
+```
+
+### Update
+**Syntax**
+```sql
+ UPDATE tableIdentifier SET column = EXPRESSION(,column = EXPRESSION) [ WHERE 
boolExpression]
+```
+**Case**
+```sql
+ update h0 set price = price + 20 where id = 1;
+ update h0 set price = price *2, name = 'a2' where id = 2;
+```
+
+### Delete
+**Syntax**
+```sql
+ DELETE FROM tableIdentifier [ WHERE BOOL_EXPRESSION]
+```
+**Case**
+```sql
+delete from h0 where id = 1;
+```
+### AlterTable
+**Syntx**
+```sql
+-- Alter table name
+ALTER TABLE oldTableName RENAME TO newTableName
+
+-- Alter table add columns
+ALTER TABLE tableIdentifier ADD COLUMNS(colAndType (,colAndType)*)
+
+-- Alter table column type
+ALTER TABLE tableIdentifier CHANGE COLUMN colName colName colType
+```
+**Case**
+```sql
+alter table h0 rename to h0_1;
+
+alter table h0_1 add columns(ext0 string);
+
+alter tablee h0_1 change column id id bigint;
+```
+
+## Set hudi config
+### Use set command
+You can use the **set** command to set the hudi's config, which will work for 
the 
+whole spark session scope.
+```sql
+set hoodie.insert.shuffle.parallelism = 100;
+set hoodie.upsert.shuffle.parallelism = 100;
+set hoodie.delete.shuffle.parallelism = 100;
+```
+
+### Set with table options
+You can also set the config in the table's options when creating table which 
will work for
+the table scope only and override the config set by the SET command.
+```sql
+create table if not exists h3(
+  id bigint, 
+  name string, 
+  price double
+) using hudi
+options (
+  primaryKey = 'id',
+  type = 'mor',
+  hoodie.index.type = 'GLOBAL_BLOOM'

Review comment:
       here what i mean is that we should describe how to set hoodie config 
such as `hoodie.datasource.write.operation` and other config.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] leesf commented on a change in pull request #3140: [HUDI-2063] Add Doc For Spark Sql Integrates With Hudi

Reply via email to