xushiyan commented on a change in pull request #4269: URL: https://github.com/apache/hudi/pull/4269#discussion_r766367063
########## File path: website/docs/quick-start-guide.md ########## @@ -175,18 +175,163 @@ values={[ </TabItem> <TabItem value="sparksql"> +Spark-sql needs an explicit create table command. + +- Table types: + Both types of hudi tables (CopyOnWrite (COW) and MergeOnRead (MOR)) can be created using spark-sql. Review comment: ```suggestion Both of Hudi's table types (Copy-On-Write (COW) and Merge-On-Read (MOR)) can be created using Spark SQL. ``` ########## File path: website/docs/quick-start-guide.md ########## @@ -175,18 +175,163 @@ values={[ </TabItem> <TabItem value="sparksql"> +Spark-sql needs an explicit create table command. Review comment: ```suggestion Spark SQL needs an explicit create table command. ``` ########## File path: website/docs/quick-start-guide.md ########## @@ -175,18 +175,163 @@ values={[ </TabItem> <TabItem value="sparksql"> +Spark-sql needs an explicit create table command. + +- Table types: + Both types of hudi tables (CopyOnWrite (COW) and MergeOnRead (MOR)) can be created using spark-sql. + + While creating the table, table type can be specified using **type** option. **type = 'cow'** represents COW table, while **type = 'mor'** represents MOR table. + +- Partitioned & Non-Partitioned table: + Users can create a partitioned table or non-partitioned table in spark-sql. Review comment: ```suggestion Users can create a partitioned table or a non-partitioned table in Spark SQL. ``` ########## File path: website/docs/quick-start-guide.md ########## @@ -175,18 +175,163 @@ values={[ </TabItem> <TabItem value="sparksql"> +Spark-sql needs an explicit create table command. + +- Table types: + Both types of hudi tables (CopyOnWrite (COW) and MergeOnRead (MOR)) can be created using spark-sql. + + While creating the table, table type can be specified using **type** option. **type = 'cow'** represents COW table, while **type = 'mor'** represents MOR table. + +- Partitioned & Non-Partitioned table: + Users can create a partitioned table or non-partitioned table in spark-sql. + To create a partitioned table, one needs to use **partitioned by** statement to specify the partition columns to create a partitioned table. + When there is no **partitioned by** statement with create table command, table is considered to be a non-partitioned table. + +- Managed & External table: + In general, spark-sql supports two kinds of tables, namely managed and external. + If one specifies a location using **location** statement or use `create external table` to create table explicitly, it is an external table, else its considered a managed table. + You can read more about external vs managed tables [here](https://sparkbyexamples.com/apache-hive/difference-between-hive-internal-tables-and-external-tables/). + +- Table with primary key: + Users can choose to create a table with primary key as required. Else table is considered a non-primary keyed table. + One needs to set **primaryKey** column in options to create a primary key table. + If you are using any of the built-in key generators in Hudi, likely it is a primary key table. + +Let's go over some of the create table commands. + +**Create a Non-Partitioned Table** + ```sql --- -create table if not exists hudi_table2( - id int, - name string, +-- create a cow table, with default primaryKey 'uuid' and without preCombineField provided +create table hudi_cow_nonpcf_tbl ( + uuid int, + name string, price double +) using hudi; + + +-- create a mor non-partitioned table without preCombineField provided +create table hudi_mor_tbl ( + id int, + name string, + price double, + ts bigint ) using hudi -options ( - type = 'cow' +tblproperties ( + type = 'cow', + primaryKey = 'id', + preCombineField = 'ts' ); ``` +Here is an example of creating an external COW partitioned table. + +**Create Partitioned Table** + +```sql +-- create a partitioned, preCombineField-provided cow table +create table hudi_cow_pt_tbl ( + id bigint, + name string, + ts bigint, + dt string, + hh string +) using hudi +tblproperties ( + type = 'cow', + primaryKey = 'id', + preCombineField = 'ts' + ) +partitioned by (dt, hh) +location '/tmp/hudi/hudi_cow_pt_tbl'; +``` + +**Create Table for an existing Hudi Table** + +We can create a table on an existing hudi table(created with spark-shell or deltastreamer). This is useful to +read/write to/from a pre-existing hudi table. + +```sql +-- create an external hudi table based on an existing path + +-- for non-partitioned table +create table hudi_existing_tbl0 using hudi +location 'file:///tmp/hudi/dataframe_hudi_nonpt_table'; + +-- for partitioned table +create table hudi_existing_tbl1 using hudi +partitioned by (dt, hh) +location 'file:///tmp/hudi/dataframe_hudi_pt_table'; +``` + +:::tip +You don't need to specify schema and any properties except the partitioned columns if existed. Hudi can automatically recognize the schema and configurations. +::: + +**CTAS** + +Hudi supports CTAS(Create Table As Select) on spark sql. <br/> +Note: For better performance to load data to hudi table, CTAS uses the **bulk insert** as the write operation. + +Example CTAS command to create a non-partitioned COW table without preCombineField. + +```sql +-- CTAS: create a non-partitioned cow table without preCombineField +create table hudi_ctas_cow_nonpcf_tbl +using hudi +tblproperties (primaryKey = 'id') +as +select 1 as id, 'a1' as name, 10 as price; +``` + +Example CTAS command to create a partitioned, primary key COW table. + +```sql +-- CTAS: create a partitioned, preCombineField-provided cow table +create table hudi_ctas_cow_pt_tbl +using hudi +tblproperties (type = 'cow', primaryKey = 'id', preCombineField = 'ts') +partitioned by (dt) +as +select 1 as id, 'a1' as name, 10 as price, 1000 as ts, '2021-12-01' as dt; + +``` + +Example CTAS command to load data from another table. + +```sql +# create managed parquet table +create table parquet_mngd using parquet location 'file:///tmp/parquet_dataset/*.parquet'; + +# CTAS by loading data into hudi table +create table hudi_ctas_cow_pt_tbl2 using hudi location 'file:/tmp/hudi/hudi_tbl/' options ( + type = 'cow', + primaryKey = 'id', + preCombineField = 'ts' + ) +partitioned by (datestr) as select * from parquet_mngd; +``` + +**Create Table Properties** + +Users can set table properties while creating a hudi table. Critical options are listed here. + +| Parameter Name | Default | Introduction | +|------------------|--------|------------| +| primaryKey | uuid | The primary key names of the table, multiple fields separated by commas. Same as `hoodie.datasource.write.recordkey.field` | +| preCombineField | | The pre-combine field of the table. Same as `hoodie.datasource.write.precombine.field` | +| type | cow | The table type to create. type = 'cow' means a COPY-ON-WRITE table, while type = 'mor' means a MERGE-ON-READ table. Same as `hoodie.datasource.write.table.type` | + +To set any custom hudi config(like index type, max parquet size, etc), see the "Set hudi config section" . + +:::note +1. Since hudi 0.10.0, `primaryKey` is required to specify. It can align with Hudi datasource writer’s and resolve many behavioural discrepancies reported in previous versions. +2. `primaryKey`, `preCombineField`, `type` is case sensitive. +3. To specify `primaryKey`, `preCombineField`, `type` or other hudi configs, `tblproperties` is a preferred way than `options`. Spark SQL syntax is detailed here. +4. A new hudi table created by spark-sql will set `hoodie.table.keygenerator.class` as `org.apache.hudi.keygen.ComplexKeyGenerator`, +`hoodie.datasource.write.hive_style_partitioning` as `true` by default. +::: Review comment: great notes section here. do you want to move to the beginning of the Spark SQL section to highlight these before users scoll long way down here? below "Table Types" ? ########## File path: website/docs/quick-start-guide.md ########## @@ -175,18 +175,163 @@ values={[ </TabItem> <TabItem value="sparksql"> +Spark-sql needs an explicit create table command. + +- Table types: + Both types of hudi tables (CopyOnWrite (COW) and MergeOnRead (MOR)) can be created using spark-sql. + + While creating the table, table type can be specified using **type** option. **type = 'cow'** represents COW table, while **type = 'mor'** represents MOR table. + +- Partitioned & Non-Partitioned table: + Users can create a partitioned table or non-partitioned table in spark-sql. + To create a partitioned table, one needs to use **partitioned by** statement to specify the partition columns to create a partitioned table. + When there is no **partitioned by** statement with create table command, table is considered to be a non-partitioned table. + +- Managed & External table: + In general, spark-sql supports two kinds of tables, namely managed and external. + If one specifies a location using **location** statement or use `create external table` to create table explicitly, it is an external table, else its considered a managed table. + You can read more about external vs managed tables [here](https://sparkbyexamples.com/apache-hive/difference-between-hive-internal-tables-and-external-tables/). + +- Table with primary key: + Users can choose to create a table with primary key as required. Else table is considered a non-primary keyed table. + One needs to set **primaryKey** column in options to create a primary key table. + If you are using any of the built-in key generators in Hudi, likely it is a primary key table. + +Let's go over some of the create table commands. + +**Create a Non-Partitioned Table** + ```sql --- -create table if not exists hudi_table2( - id int, - name string, +-- create a cow table, with default primaryKey 'uuid' and without preCombineField provided +create table hudi_cow_nonpcf_tbl ( + uuid int, + name string, price double +) using hudi; + + +-- create a mor non-partitioned table without preCombineField provided +create table hudi_mor_tbl ( + id int, + name string, + price double, + ts bigint ) using hudi -options ( - type = 'cow' +tblproperties ( + type = 'cow', + primaryKey = 'id', + preCombineField = 'ts' ); ``` +Here is an example of creating an external COW partitioned table. + +**Create Partitioned Table** + +```sql +-- create a partitioned, preCombineField-provided cow table +create table hudi_cow_pt_tbl ( + id bigint, + name string, + ts bigint, + dt string, + hh string +) using hudi +tblproperties ( + type = 'cow', + primaryKey = 'id', + preCombineField = 'ts' + ) +partitioned by (dt, hh) +location '/tmp/hudi/hudi_cow_pt_tbl'; +``` + +**Create Table for an existing Hudi Table** + +We can create a table on an existing hudi table(created with spark-shell or deltastreamer). This is useful to +read/write to/from a pre-existing hudi table. + +```sql +-- create an external hudi table based on an existing path + +-- for non-partitioned table +create table hudi_existing_tbl0 using hudi +location 'file:///tmp/hudi/dataframe_hudi_nonpt_table'; + +-- for partitioned table +create table hudi_existing_tbl1 using hudi +partitioned by (dt, hh) +location 'file:///tmp/hudi/dataframe_hudi_pt_table'; +``` + +:::tip +You don't need to specify schema and any properties except the partitioned columns if existed. Hudi can automatically recognize the schema and configurations. +::: + +**CTAS** + +Hudi supports CTAS(Create Table As Select) on spark sql. <br/> +Note: For better performance to load data to hudi table, CTAS uses the **bulk insert** as the write operation. + +Example CTAS command to create a non-partitioned COW table without preCombineField. + +```sql +-- CTAS: create a non-partitioned cow table without preCombineField +create table hudi_ctas_cow_nonpcf_tbl +using hudi +tblproperties (primaryKey = 'id') +as +select 1 as id, 'a1' as name, 10 as price; +``` + +Example CTAS command to create a partitioned, primary key COW table. + +```sql +-- CTAS: create a partitioned, preCombineField-provided cow table +create table hudi_ctas_cow_pt_tbl +using hudi +tblproperties (type = 'cow', primaryKey = 'id', preCombineField = 'ts') +partitioned by (dt) +as +select 1 as id, 'a1' as name, 10 as price, 1000 as ts, '2021-12-01' as dt; + +``` + +Example CTAS command to load data from another table. + +```sql +# create managed parquet table +create table parquet_mngd using parquet location 'file:///tmp/parquet_dataset/*.parquet'; + +# CTAS by loading data into hudi table +create table hudi_ctas_cow_pt_tbl2 using hudi location 'file:/tmp/hudi/hudi_tbl/' options ( + type = 'cow', + primaryKey = 'id', + preCombineField = 'ts' + ) +partitioned by (datestr) as select * from parquet_mngd; +``` + +**Create Table Properties** + +Users can set table properties while creating a hudi table. Critical options are listed here. + +| Parameter Name | Default | Introduction | +|------------------|--------|------------| +| primaryKey | uuid | The primary key names of the table, multiple fields separated by commas. Same as `hoodie.datasource.write.recordkey.field` | +| preCombineField | | The pre-combine field of the table. Same as `hoodie.datasource.write.precombine.field` | +| type | cow | The table type to create. type = 'cow' means a COPY-ON-WRITE table, while type = 'mor' means a MERGE-ON-READ table. Same as `hoodie.datasource.write.table.type` | + +To set any custom hudi config(like index type, max parquet size, etc), see the "Set hudi config section" . + +:::note +1. Since hudi 0.10.0, `primaryKey` is required to specify. It can align with Hudi datasource writer’s and resolve many behavioural discrepancies reported in previous versions. +2. `primaryKey`, `preCombineField`, `type` is case sensitive. +3. To specify `primaryKey`, `preCombineField`, `type` or other hudi configs, `tblproperties` is a preferred way than `options`. Spark SQL syntax is detailed here. +4. A new hudi table created by spark-sql will set `hoodie.table.keygenerator.class` as `org.apache.hudi.keygen.ComplexKeyGenerator`, Review comment: ```suggestion 4. A new hudi table created by Spark SQL will set `hoodie.table.keygenerator.class` as `org.apache.hudi.keygen.ComplexKeyGenerator`, and ``` ########## File path: website/docs/quick-start-guide.md ########## @@ -175,18 +175,163 @@ values={[ </TabItem> <TabItem value="sparksql"> +Spark-sql needs an explicit create table command. + +- Table types: + Both types of hudi tables (CopyOnWrite (COW) and MergeOnRead (MOR)) can be created using spark-sql. + + While creating the table, table type can be specified using **type** option. **type = 'cow'** represents COW table, while **type = 'mor'** represents MOR table. + +- Partitioned & Non-Partitioned table: + Users can create a partitioned table or non-partitioned table in spark-sql. + To create a partitioned table, one needs to use **partitioned by** statement to specify the partition columns to create a partitioned table. + When there is no **partitioned by** statement with create table command, table is considered to be a non-partitioned table. + +- Managed & External table: + In general, spark-sql supports two kinds of tables, namely managed and external. + If one specifies a location using **location** statement or use `create external table` to create table explicitly, it is an external table, else its considered a managed table. + You can read more about external vs managed tables [here](https://sparkbyexamples.com/apache-hive/difference-between-hive-internal-tables-and-external-tables/). + +- Table with primary key: + Users can choose to create a table with primary key as required. Else table is considered a non-primary keyed table. + One needs to set **primaryKey** column in options to create a primary key table. + If you are using any of the built-in key generators in Hudi, likely it is a primary key table. + +Let's go over some of the create table commands. + +**Create a Non-Partitioned Table** + ```sql --- -create table if not exists hudi_table2( - id int, - name string, +-- create a cow table, with default primaryKey 'uuid' and without preCombineField provided +create table hudi_cow_nonpcf_tbl ( + uuid int, + name string, price double +) using hudi; + + +-- create a mor non-partitioned table without preCombineField provided +create table hudi_mor_tbl ( + id int, + name string, + price double, + ts bigint ) using hudi -options ( - type = 'cow' +tblproperties ( + type = 'cow', + primaryKey = 'id', + preCombineField = 'ts' ); ``` +Here is an example of creating an external COW partitioned table. + +**Create Partitioned Table** + +```sql +-- create a partitioned, preCombineField-provided cow table +create table hudi_cow_pt_tbl ( + id bigint, + name string, + ts bigint, + dt string, + hh string +) using hudi +tblproperties ( + type = 'cow', + primaryKey = 'id', + preCombineField = 'ts' + ) +partitioned by (dt, hh) +location '/tmp/hudi/hudi_cow_pt_tbl'; +``` + +**Create Table for an existing Hudi Table** + +We can create a table on an existing hudi table(created with spark-shell or deltastreamer). This is useful to +read/write to/from a pre-existing hudi table. + +```sql +-- create an external hudi table based on an existing path + +-- for non-partitioned table +create table hudi_existing_tbl0 using hudi +location 'file:///tmp/hudi/dataframe_hudi_nonpt_table'; + +-- for partitioned table +create table hudi_existing_tbl1 using hudi +partitioned by (dt, hh) +location 'file:///tmp/hudi/dataframe_hudi_pt_table'; +``` + +:::tip +You don't need to specify schema and any properties except the partitioned columns if existed. Hudi can automatically recognize the schema and configurations. +::: + +**CTAS** + +Hudi supports CTAS(Create Table As Select) on spark sql. <br/> Review comment: ```suggestion Hudi supports CTAS (Create Table As Select) on spark sql. <br/> ``` ########## File path: website/docs/quick-start-guide.md ########## @@ -175,18 +175,163 @@ values={[ </TabItem> <TabItem value="sparksql"> +Spark-sql needs an explicit create table command. + +- Table types: + Both types of hudi tables (CopyOnWrite (COW) and MergeOnRead (MOR)) can be created using spark-sql. + + While creating the table, table type can be specified using **type** option. **type = 'cow'** represents COW table, while **type = 'mor'** represents MOR table. + +- Partitioned & Non-Partitioned table: + Users can create a partitioned table or non-partitioned table in spark-sql. + To create a partitioned table, one needs to use **partitioned by** statement to specify the partition columns to create a partitioned table. + When there is no **partitioned by** statement with create table command, table is considered to be a non-partitioned table. + +- Managed & External table: + In general, spark-sql supports two kinds of tables, namely managed and external. + If one specifies a location using **location** statement or use `create external table` to create table explicitly, it is an external table, else its considered a managed table. + You can read more about external vs managed tables [here](https://sparkbyexamples.com/apache-hive/difference-between-hive-internal-tables-and-external-tables/). + +- Table with primary key: + Users can choose to create a table with primary key as required. Else table is considered a non-primary keyed table. + One needs to set **primaryKey** column in options to create a primary key table. + If you are using any of the built-in key generators in Hudi, likely it is a primary key table. + +Let's go over some of the create table commands. + +**Create a Non-Partitioned Table** + ```sql --- -create table if not exists hudi_table2( - id int, - name string, +-- create a cow table, with default primaryKey 'uuid' and without preCombineField provided +create table hudi_cow_nonpcf_tbl ( + uuid int, + name string, price double +) using hudi; + + +-- create a mor non-partitioned table without preCombineField provided +create table hudi_mor_tbl ( + id int, + name string, + price double, + ts bigint ) using hudi -options ( - type = 'cow' +tblproperties ( + type = 'cow', + primaryKey = 'id', + preCombineField = 'ts' ); ``` +Here is an example of creating an external COW partitioned table. + +**Create Partitioned Table** + +```sql +-- create a partitioned, preCombineField-provided cow table +create table hudi_cow_pt_tbl ( + id bigint, + name string, + ts bigint, + dt string, + hh string +) using hudi +tblproperties ( + type = 'cow', + primaryKey = 'id', + preCombineField = 'ts' + ) +partitioned by (dt, hh) +location '/tmp/hudi/hudi_cow_pt_tbl'; +``` + +**Create Table for an existing Hudi Table** + +We can create a table on an existing hudi table(created with spark-shell or deltastreamer). This is useful to +read/write to/from a pre-existing hudi table. + +```sql +-- create an external hudi table based on an existing path + +-- for non-partitioned table +create table hudi_existing_tbl0 using hudi +location 'file:///tmp/hudi/dataframe_hudi_nonpt_table'; + +-- for partitioned table +create table hudi_existing_tbl1 using hudi +partitioned by (dt, hh) +location 'file:///tmp/hudi/dataframe_hudi_pt_table'; +``` + +:::tip +You don't need to specify schema and any properties except the partitioned columns if existed. Hudi can automatically recognize the schema and configurations. +::: + +**CTAS** + +Hudi supports CTAS(Create Table As Select) on spark sql. <br/> +Note: For better performance to load data to hudi table, CTAS uses the **bulk insert** as the write operation. + +Example CTAS command to create a non-partitioned COW table without preCombineField. + +```sql +-- CTAS: create a non-partitioned cow table without preCombineField +create table hudi_ctas_cow_nonpcf_tbl +using hudi +tblproperties (primaryKey = 'id') +as +select 1 as id, 'a1' as name, 10 as price; +``` + +Example CTAS command to create a partitioned, primary key COW table. + +```sql +-- CTAS: create a partitioned, preCombineField-provided cow table +create table hudi_ctas_cow_pt_tbl +using hudi +tblproperties (type = 'cow', primaryKey = 'id', preCombineField = 'ts') +partitioned by (dt) +as +select 1 as id, 'a1' as name, 10 as price, 1000 as ts, '2021-12-01' as dt; + +``` + +Example CTAS command to load data from another table. + +```sql +# create managed parquet table +create table parquet_mngd using parquet location 'file:///tmp/parquet_dataset/*.parquet'; + +# CTAS by loading data into hudi table +create table hudi_ctas_cow_pt_tbl2 using hudi location 'file:/tmp/hudi/hudi_tbl/' options ( + type = 'cow', + primaryKey = 'id', + preCombineField = 'ts' + ) +partitioned by (datestr) as select * from parquet_mngd; +``` + +**Create Table Properties** + +Users can set table properties while creating a hudi table. Critical options are listed here. + +| Parameter Name | Default | Introduction | +|------------------|--------|------------| +| primaryKey | uuid | The primary key names of the table, multiple fields separated by commas. Same as `hoodie.datasource.write.recordkey.field` | +| preCombineField | | The pre-combine field of the table. Same as `hoodie.datasource.write.precombine.field` | +| type | cow | The table type to create. type = 'cow' means a COPY-ON-WRITE table, while type = 'mor' means a MERGE-ON-READ table. Same as `hoodie.datasource.write.table.type` | + +To set any custom hudi config(like index type, max parquet size, etc), see the "Set hudi config section" . + +:::note +1. Since hudi 0.10.0, `primaryKey` is required to specify. It can align with Hudi datasource writer’s and resolve many behavioural discrepancies reported in previous versions. +2. `primaryKey`, `preCombineField`, `type` is case sensitive. +3. To specify `primaryKey`, `preCombineField`, `type` or other hudi configs, `tblproperties` is a preferred way than `options`. Spark SQL syntax is detailed here. Review comment: ```suggestion 3. To specify `primaryKey`, `preCombineField`, `type` or other hudi configs, `tblproperties` is the preferred way than `options`. Spark SQL syntax is detailed here. ``` ########## File path: website/docs/quick-start-guide.md ########## @@ -175,18 +175,163 @@ values={[ </TabItem> <TabItem value="sparksql"> +Spark-sql needs an explicit create table command. + +- Table types: + Both types of hudi tables (CopyOnWrite (COW) and MergeOnRead (MOR)) can be created using spark-sql. + + While creating the table, table type can be specified using **type** option. **type = 'cow'** represents COW table, while **type = 'mor'** represents MOR table. + +- Partitioned & Non-Partitioned table: + Users can create a partitioned table or non-partitioned table in spark-sql. + To create a partitioned table, one needs to use **partitioned by** statement to specify the partition columns to create a partitioned table. + When there is no **partitioned by** statement with create table command, table is considered to be a non-partitioned table. + +- Managed & External table: + In general, spark-sql supports two kinds of tables, namely managed and external. Review comment: ```suggestion In general, Spark SQL supports two kinds of tables, namely managed and external. ``` ########## File path: website/docs/quick-start-guide.md ########## @@ -175,18 +175,163 @@ values={[ </TabItem> <TabItem value="sparksql"> +Spark-sql needs an explicit create table command. + +- Table types: + Both types of hudi tables (CopyOnWrite (COW) and MergeOnRead (MOR)) can be created using spark-sql. + + While creating the table, table type can be specified using **type** option. **type = 'cow'** represents COW table, while **type = 'mor'** represents MOR table. + +- Partitioned & Non-Partitioned table: + Users can create a partitioned table or non-partitioned table in spark-sql. + To create a partitioned table, one needs to use **partitioned by** statement to specify the partition columns to create a partitioned table. + When there is no **partitioned by** statement with create table command, table is considered to be a non-partitioned table. + +- Managed & External table: + In general, spark-sql supports two kinds of tables, namely managed and external. + If one specifies a location using **location** statement or use `create external table` to create table explicitly, it is an external table, else its considered a managed table. + You can read more about external vs managed tables [here](https://sparkbyexamples.com/apache-hive/difference-between-hive-internal-tables-and-external-tables/). + +- Table with primary key: + Users can choose to create a table with primary key as required. Else table is considered a non-primary keyed table. Review comment: Shall we mention here non-pk table not supported? and explain how `uuid` works as the implicit default pk -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org