[GitHub] [hudi] xiarixiaoyao commented on a diff in pull request #5238: [DOC]Add schema evolution doc for sparksql

GitBox Mon, 25 Apr 2022 03:42:21 -0700


xiarixiaoyao commented on code in PR #5238:
URL: https://github.com/apache/hudi/pull/5238#discussion_r857491914



##########
website/docs/quick-start-guide.md:
##########
@@ -1095,6 +1095,178 @@ Currently,  the result of `show partitions` is based on 
the filesystem table pat
 
 :::
 
+## Schema evolution
+Schema evolution allows users to easily change the current schema of a Hudi 
table to adapt to the data that is changing over time.
+As of 0.11.0 release, Spark SQL(spark3.1.x and spark3.2.1) DDL support for 
Schema  evolution has been added and is experimental.
+
+### Schema Evolution Scenarios
+1) Columns (including nested columns) can be added, deleted, modified, and 
moved.
+2) Partition columns cannot be evolved.
+3) You cannot add, delete, or perform operations on nested columns of the 
Array type.
+
+## SparkSQL Schema Evolution and Syntax Description
+Before using schema evolution, pls set `spark.sql.extensions`. For spark3.2.1 
`spark.sql.catalog.spark_catalog` also need to be set.
+```shell
+# Spark SQL for spark 3.1.x
+spark-sql --packages 
org.apache.hudi:hudi-spark3.1.2-bundle_2.12:0.11.1,org.apache.spark:spark-avro_2.12:3.1.2
 \
+--conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer' \
+--conf 
'spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension'
+
+# Spark SQL for spark 3.2.1
+spark-sql --packages 
org.apache.hudi:hudi-spark3-bundle_2.12:0.11.1,org.apache.spark:spark-avro_2.12:3.2.1
 \
+--conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer' \
+--conf 
'spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension' \
+--conf 
'spark.sql.catalog.spark_catalog=org.apache.spark.sql.hudi.catalog.HoodieCatalog'
+
+```
+After start spark-app,  pls exec `set schema.on.read.enable=true` to enable 
schema evolution.
+
+:::note
+Currently, Schema evolution cannot disabled once being enabled.
+
+
+:::
+
+### Adding Columns
+**Syntax**
+```sql
+-- add columns
+ALTER TABLE Table name ADD COLUMNS(col_spec[, col_spec ...])
+```
+**Parameter Description**
+
+| Parameter       | Description                  |
+|-----------------|------------------------------|
+| tableName       | Table name                   |
+| col_spec        | Column specifications, consisting of five fields, 
*col_name*, *col_type*, *nullable*, *comment*, and *col_position*.|
+
+**col_name** : name of the new column. It is mandatory.To add a sub-column to 
a nested column, specify the full name of the sub-column in this field.
+
+For example:
+
+1. To add sub-column col1 to a nested struct type column column users 
struct<name: string, age: int>, set this field to users.col1.
+
+2. To add sub-column col1 to a nested map type column memeber map<string, 
struct<n: string, a: int>>, set this field to member.value.col1.
+
+**col_type** : type of the new column.
+
+**nullable** : whether the new column can be null. The value can be left 
empty. Now this field is not used in Hudi.
+
+**comment** : comment of the new column. The value can be left empty.
+
+**col_position** : position where the new column is added. The value can be 
*FIRST* or *AFTER* origin_col.
+
+1. If it is set to *FIRST*, the new column will be added to the first column 
of the table.
+
+2. If it is set to *AFTER* origin_col, the new column will be added after 
original column origin_col.
+
+3. The value can be left empty. *FIRST* can be used only when new sub-columns 
are added to nested columns. Do not use *FIRST* in top-level columns. There are 
no restrictions about the usage of *AFTER*.
+
+**Examples**
+
+```sql
+alter table h0 add columns(ext0 string);
+alter table h0 add columns(new_col int not null comment 'add new column' after 
col1);
+alter table complex_table add columns(col_struct.col_name string comment 'add 
new column to a struct col' after col_from_col_struct);
+```
+
+### Altering Columns
+**Syntax**
+```sql
+-- alter table ... alter column
+ALTER TABLE Table name ALTER [COLUMN] col_old_name TYPE column_type [COMMENT] 
col_comment[FIRST|AFTER] column_name
+```
+
+**Parameter Description**
+
+| Parameter       | Description                  |
+|-----------------|------------------------------|
+| tableName      | Table name.                   |
+| col_old_name   | Name of the column to be altered.|
+| column_type    | Type of the target column.|
+| col_comment    | col_comment.|
+| column_name    | New position to place the target column. For example, 
*AFTER* **column_name** indicates that the target column is placed after 
**column_name**.|
+
+
+**Examples**
+
+```sql
+--- Changing the column type
+ALTER TABLE table1 ALTER COLUMN a.b.c TYPE bigint
+
+--- Altering other attributes
+ALTER TABLE table1 ALTER COLUMN a.b.c COMMENT 'new comment'
+ALTER TABLE table1 ALTER COLUMN a.b.c FIRST
+ALTER TABLE table1 ALTER COLUMN a.b.c AFTER x
+ALTER TABLE table1 ALTER COLUMN a.b.c DROP NOT NULL
+```
+
+**column type change**
+
+| old_type        | new_type                        |
+|-----------------|---------------------------------|
+| int             | long/float/double/string/decimal|
+| long            | double/string/decimal           |
+| float           | double/String/decimal           |
+| double          | string/decimal                  |
+| decimal         | decimal/string                  |
+| string          | decimal/date                    |
+| date            | string                          |

Review Comment:
   thanks for your remind， let me deal with it tomorow



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] xiarixiaoyao commented on a diff in pull request #5238: [DOC]Add schema evolution doc for sparksql

Reply via email to