[hudi] branch asf-site updated: [HUDI-1693] Add document about HUDI Flink integration (#2681)

vinoyang Tue, 16 Mar 2021 23:20:26 -0700

This is an automated email from the ASF dual-hosted git repository.

vinoyang pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/hudi.git



The following commit(s) were added to refs/heads/asf-site by this push:
     new db9cb1c  [HUDI-1693] Add document about HUDI Flink integration (#2681)
db9cb1c is described below

commit db9cb1c568d4180125dd654e815afa64fa2aeb2a
Author: Danny Chan <[email protected]>
AuthorDate: Wed Mar 17 14:19:50 2021 +0800

    [HUDI-1693] Add document about HUDI Flink integration (#2681)
---
 docs/_config.yml                                   |   4 +-
 docs/_data/navigation.yml                          |  12 +-
 docs/_docs/0_3_migration_guide.cn.md               |   2 +-
 docs/_docs/0_3_migration_guide.md                  |   2 +-
 ...ide.cn.md => 1_1_spark_quick_start_guide.cn.md} |   2 +-
 ...art_guide.md => 1_1_spark_quick_start_guide.md} |   2 +-
 docs/_docs/1_6_flink_quick_start_guide.md          | 169 +++++++++++++++++++++
 docs/_docs/2_2_writing_data.md                     |  34 +++++
 docs/_docs/2_3_querying_data.md                    |  45 +++++-
 docs/_docs/2_4_configurations.md                   |  53 +++++++
 docs/_layouts/home.html                            |   2 +-
 docs/_pages/contributing.cn.md                     |   2 +-
 docs/_pages/contributing.md                        |   2 +-
 docs/_pages/releases.md                            |   2 +-
 14 files changed, 317 insertions(+), 16 deletions(-)

diff --git a/docs/_config.yml b/docs/_config.yml
index 6d8a6fb..270d8a2 100644
--- a/docs/_config.yml
+++ b/docs/_config.yml
@@ -12,8 +12,8 @@ version : &version "0.5.1-SNAPSHOT"
 
 previous_docs:
   - version: Latest
-    en: /docs/quick-start-guide.html
-    cn: /cn/docs/quick-start-guide.html
+    en: /docs/spark_quick-start-guide.html
+    cn: /cn/docs/spark_quick-start-guide.html
   - version: 0.7.0
     en: /docs/0.7.0-quick-start-guide.html
     cn: /cn/docs/0.7.0-quick-start-guide.html
diff --git a/docs/_data/navigation.yml b/docs/_data/navigation.yml
index 4054bc8..5803a43 100644
--- a/docs/_data/navigation.yml
+++ b/docs/_data/navigation.yml
@@ -2,7 +2,7 @@
 # main links
 main:
   - title: "Documentation"
-    url: /docs/quick-start-guide.html
+    url: /docs/spark_quick-start-guide.html
   - title: "Community"
     url: /community.html
   - title: "Blog"
@@ -20,8 +20,10 @@ docs:
     children:
       - title: "Overview"
         url: /docs/overview.html
-      - title: "Quick Start"
-        url: /docs/quick-start-guide.html
+      - title: "Quick Start(Spark)"
+        url: /docs/spark_quick-start-guide.html
+      - title: "Quick Start(Flink)"
+        url: /docs/flink-quick-start-guide.html
       - title: "Use Cases"
         url: /docs/use_cases.html
       - title: "Writing Data"
@@ -50,7 +52,7 @@ docs:
 
 cn_main:
   - title: "文档"
-    url: /cn/docs/quick-start-guide.html
+    url: /cn/docs/spark_quick-start-guide.html
   - title: "社区"
     url: /cn/community.html
   - title: "动态"
@@ -65,7 +67,7 @@ cn_docs:
   - title: 入门指南
     children:
       - title: "快速开始"
-        url: /cn/docs/quick-start-guide.html
+        url: /cn/docs/spark_quick-start-guide.html
       - title: "使用案例"
         url: /cn/docs/use_cases.html
       - title: "演讲 & hudi 用户"
diff --git a/docs/_docs/0_3_migration_guide.cn.md 
b/docs/_docs/0_3_migration_guide.cn.md
index f90229a..95cab06 100644
--- a/docs/_docs/0_3_migration_guide.cn.md
+++ b/docs/_docs/0_3_migration_guide.cn.md
@@ -52,7 +52,7 @@ for partition in [list of partitions in source dataset] {
 
 **Option 3**
 Write your own custom logic of how to load an existing dataset into a Hudi 
managed one. Please read about the RDD API
- [here](/cn/docs/quick-start-guide.html). Using the HDFSParquetImporter Tool. 
Once hudi has been built via `mvn clean install -DskipTests`, the shell can be
+ [here](/cn/docs/spark_quick-start-guide.html). Using the HDFSParquetImporter 
Tool. Once hudi has been built via `mvn clean install -DskipTests`, the shell 
can be
 fired by via `cd hudi-cli && ./hudi-cli.sh`.
 
 ```java
diff --git a/docs/_docs/0_3_migration_guide.md 
b/docs/_docs/0_3_migration_guide.md
index 25c70f6..012abc0 100644
--- a/docs/_docs/0_3_migration_guide.md
+++ b/docs/_docs/0_3_migration_guide.md
@@ -51,7 +51,7 @@ for partition in [list of partitions in source table] {
 
 **Option 3**
 Write your own custom logic of how to load an existing table into a Hudi 
managed one. Please read about the RDD API
- [here](/docs/quick-start-guide.html). Using the HDFSParquetImporter Tool. 
Once hudi has been built via `mvn clean install -DskipTests`, the shell can be
+ [here](/docs/spark_quick-start-guide.html). Using the HDFSParquetImporter 
Tool. Once hudi has been built via `mvn clean install -DskipTests`, the shell 
can be
 fired by via `cd hudi-cli && ./hudi-cli.sh`.
 
 ```java
diff --git a/docs/_docs/1_1_quick_start_guide.cn.md 
b/docs/_docs/1_1_spark_quick_start_guide.cn.md
similarity index 99%
rename from docs/_docs/1_1_quick_start_guide.cn.md
rename to docs/_docs/1_1_spark_quick_start_guide.cn.md
index 9c12c20..dbdb30a 100644
--- a/docs/_docs/1_1_quick_start_guide.cn.md
+++ b/docs/_docs/1_1_spark_quick_start_guide.cn.md
@@ -1,6 +1,6 @@
 ---
 title: "Quick-Start Guide"
-permalink: /cn/docs/quick-start-guide.html
+permalink: /cn/docs/spark_quick-start-guide.html
 toc: true
 last_modified_at: 2019-12-30T15:59:57-04:00
 language: cn
diff --git a/docs/_docs/1_1_quick_start_guide.md 
b/docs/_docs/1_1_spark_quick_start_guide.md
similarity index 99%
rename from docs/_docs/1_1_quick_start_guide.md
rename to docs/_docs/1_1_spark_quick_start_guide.md
index cccf748..fced71d 100644
--- a/docs/_docs/1_1_quick_start_guide.md
+++ b/docs/_docs/1_1_spark_quick_start_guide.md
@@ -1,6 +1,6 @@
 ---
 title: "Quick-Start Guide"
-permalink: /docs/quick-start-guide.html
+permalink: /docs/spark_quick-start-guide.html
 toc: true
 last_modified_at: 2019-12-30T15:59:57-04:00
 ---
diff --git a/docs/_docs/1_6_flink_quick_start_guide.md 
b/docs/_docs/1_6_flink_quick_start_guide.md
new file mode 100644
index 0000000..c16028a
--- /dev/null
+++ b/docs/_docs/1_6_flink_quick_start_guide.md
@@ -0,0 +1,169 @@
+---
+title: "Quick-Start Guide"
+permalink: /docs/flink-quick-start-guide.html
+toc: true
+last_modified_at: 2020-03-16T11:40:57+08:00
+---
+
+This guide provides a quick peek at Hudi's capabilities using flink SQL 
client. Using flink SQL, we will walk through 
+code snippets that allows you to insert and update a Hudi table of default 
table type: 
+[Copy on Write](/docs/concepts.html#copy-on-write-table) and [Merge On 
Read](/docs/concepts.html#merge-on-read-table). 
+After each write operation we will also show how to read the data snapshot 
(incrementally read is already on the roadmap).
+
+## Setup
+
+We use the [Flink Sql 
Client](https://ci.apache.org/projects/flink/flink-docs-stable/dev/table/sqlClient.html)
 because it's a good
+quick start tool for SQL users.
+
+### Step.1 download flink jar
+Hudi works with Flink-1.11.x version. You can follow instructions 
[here](https://flink.apache.org/downloads.html) for setting up flink.
+The hudi-flink-bundle jar is archived with scala 2.11, so it’s recommended to 
use flink 1.11 bundled with scala 2.11.
+
+### Step.2 start flink cluster
+Start a standalone flink cluster within hadoop environment.
+Before you start up the cluster, we suggest to config the cluster as follows:
+
+- in `$FLINK_HOME/conf/flink-conf.yaml`, add config option 
`taskmanager.numberOfTaskSlots: 4`
+- in `$FLINK_HOME/conf/workers`, add item `localhost` as 4 lines so that there 
are 4 workers on the local cluster
+
+Now starts the cluster:
+
+```bash
+# HADOOP_HOME is your hadoop root directory after unpack the binary package.
+export HADOOP_CLASSPATH=`$HADOOP_HOME/bin/hadoop classpath`
+
+# Start the flink standalone cluster
+./bin/start-cluster.sh
+```
+### Step.3 start flink SQL client
+
+Hudi has a prepared bundle jar for flink, which should be loaded in the flink 
SQL Client when it starts up.
+You can build the jar manually under path 
`hudi-source-dir/packaging/hudi-flink-bundle`, or download it from the
+[Apache Official 
Repository](https://repo.maven.apache.org/maven2/org/apache/hudi/hudi-flink-bundle_2.11/).
+
+Now starts the SQL CLI:
+
+```bash
+# HADOOP_HOME is your hadoop root directory after unpack the binary package.
+export HADOOP_CLASSPATH=`$HADOOP_HOME/bin/hadoop classpath`
+
+./bin/sql-client.sh embedded -j .../hudi-flink-bundle_2.1?-*.*.*.jar shell
+```
+
+<div class="notice--info">
+  <h4>Please note the following: </h4>
+<ul>
+  <li>We suggest hadoop 2.9.x+ version because some of the object storage has 
filesystem implementation only after that</li>
+  <li>The flink-parquet and flink-avro formats are already packaged into the 
hudi-flink-bundle jar</li>
+</ul>
+</div>
+
+Setup table name, base path and operate using SQL for this guide.
+The SQL CLI only executes the SQL line by line.
+
+## Insert data
+
+Creates a flink hudi table first and insert data into the Hudi table using SQL 
`VALUES` as below.
+
+```sql
+-- sets up the result mode to tableau to show the results directly in the CLI
+set execution.result-mode=tableau;
+
+CREATE TABLE t1(
+  uuid VARCHAR(20),
+  name VARCHAR(10),
+  age INT,
+  ts TIMESTAMP(3),
+  `partition` VARCHAR(20)
+)
+PARTITIONED BY (`partition`)
+WITH (
+  'connector' = 'hudi',
+  'path' = 'schema://base-path',
+  'table.type' = 'MERGE_ON_READ' -- this creates a MERGE_ON_READ table, by 
default is COPY_ON_WRITE
+);
+
+-- insert data using values
+INSERT INTO t1 VALUES
+  ('id1','Danny',23,TIMESTAMP '1970-01-01 00:00:01','par1'),
+  ('id2','Stephen',33,TIMESTAMP '1970-01-01 00:00:02','par1'),
+  ('id3','Julian',53,TIMESTAMP '1970-01-01 00:00:03','par2'),
+  ('id4','Fabian',31,TIMESTAMP '1970-01-01 00:00:04','par2'),
+  ('id5','Sophia',18,TIMESTAMP '1970-01-01 00:00:05','par3'),
+  ('id6','Emma',20,TIMESTAMP '1970-01-01 00:00:06','par3'),
+  ('id7','Bob',44,TIMESTAMP '1970-01-01 00:00:07','par4'),
+  ('id8','Han',56,TIMESTAMP '1970-01-01 00:00:08','par4');
+```
+
+## Query data
+
+```sql
+-- query from the hudi table
+select * from t1;
+```
+
+This query provides snapshot querying of the ingested data. 
+Refer to [Table types and queries](/docs/concepts#table-types--queries) for 
more info on all table types and query types supported.
+{: .notice--info}
+
+## Update data
+
+This is similar to inserting new data.
+
+```sql
+-- this would update the record with key 'id1'
+insert into t1 values
+  ('id1','Danny',27,TIMESTAMP '1970-01-01 00:00:01','par1');
+```
+
+Notice that the save mode is now `Append`. In general, always use append mode 
unless you are trying to create the table for the first time.
+[Querying](#query-data) the data again will now show updated records. Each 
write operation generates a new [commit](/docs/concepts.html) 
+denoted by the timestamp. Look for changes in `_hoodie_commit_time`, `age` 
fields for the same `_hoodie_record_key`s in previous commit. 
+{: .notice--info}
+
+## Streaming query
+
+Hudi flink also provides capability to obtain a stream of records that changed 
since given commit timestamp. 
+This can be achieved using Hudi's streaming querying and providing a start 
time from which changes need to be streamed. 
+We do not need to specify endTime, if we want all changes after the given 
commit (as is the common case). 
+
+```sql
+CREATE TABLE t1(
+  uuid VARCHAR(20),
+  name VARCHAR(10),
+  age INT,
+  ts TIMESTAMP(3),
+  `partition` VARCHAR(20)
+)
+PARTITIONED BY (`partition`)
+WITH (
+  'connector' = 'hudi',
+  'path' = 'oss://vvr-daily/hudi/t1',
+  'table.type' = 'MERGE_ON_READ',
+  'read.streaming.enabled' = 'true',  -- this option enable the streaming read
+  'read.streaming.start-commit' = '20210316134557' -- specifies the start 
commit instant time
+  'read.streaming.check-interval' = '4' -- specifies the check interval for 
finding new source commits, default 60s.
+);
+
+-- Then query the table in stream mode
+select * from t1;
+``` 
+
+This will give all changes that happened after the 
`read.streaming.start-commit` commit. The unique thing about this
+feature is that it now lets you author streaming pipelines on streaming or 
batch data source.
+{: .notice--info}
+
+## Delete data {#deletes}
+
+When consuming data in streaming query, hudi flink source can also accepts the 
change logs from the underneath data source,
+it can then applies the UPDATE and DELETE by per-row level. You can then sync 
a NEAR-REAL-TIME snapshot on hudi for all kinds
+of RDBMS.
+
+## Where to go from here?
+
+We used flink here to show case the capabilities of Hudi. However, Hudi can 
support multiple table types/query types and 
+Hudi tables can be queried from query engines like Hive, Spark, Flink, Presto 
and much more. We have put together a 
+[demo video](https://www.youtube.com/watch?v=VhNgUsxdrD0) that show cases all 
of this on a docker based setup with all 
+dependent systems running locally. We recommend you replicate the same setup 
and run the demo yourself, by following 
+steps [here](/docs/docker_demo.html) to get a taste for it. Also, if you are 
looking for ways to migrate your existing data 
+to Hudi, refer to [migration guide](/docs/migration_guide.html). 
diff --git a/docs/_docs/2_2_writing_data.md b/docs/_docs/2_2_writing_data.md
index 6b51878..affb731 100644
--- a/docs/_docs/2_2_writing_data.md
+++ b/docs/_docs/2_2_writing_data.md
@@ -262,6 +262,40 @@ inputDF.write()
        .save(basePath);
 ```
 
+## Flink SQL Writer
+The hudi-flink module defines the Flink SQL connector for both hudi source and 
sink.
+There are a number of options available for the sink table:
+
+|  Option Name  | Required | Default | Remarks |
+|  -----------  | -------  | ------- | ------- |
+| path | Y | N/A | Base path for the target hoodie table. The path would be 
created if it does not exist, otherwise a hudi table expects to be initialized 
successfully |
+| table.type  | N | COPY_ON_WRITE | Type of table to write. COPY_ON_WRITE (or) 
MERGE_ON_READ |
+| write.operation | N | upsert | The write operation, that this write should 
do (insert or upsert is supported) |
+| write.precombine.field | N | ts | Field used in preCombining before actual 
write. When two records have the same key value, we will pick the one with the 
largest value for the precombine field, determined by Object.compareTo(..) |
+| write.payload.class | N | OverwriteWithLatestAvroPayload.class | Payload 
class used. Override this, if you like to roll your own merge logic, when 
upserting/inserting. This will render any value set for the option in-effective 
|
+| write.insert.drop.duplicates | N | false | Flag to indicate whether to drop 
duplicates upon insert. By default insert will accept duplicates, to gain extra 
performance |
+| write.ignore.failed | N | true | Flag to indicate whether to ignore any non 
exception error (e.g. writestatus error). within a checkpoint batch. By default 
true (in favor of streaming progressing over data integrity) |
+| hoodie.datasource.write.recordkey.field | N | uuid | Record key field. Value 
to be used as the `recordKey` component of `HoodieKey`. Actual value will be 
obtained by invoking .toString() on the field value. Nested fields can be 
specified using the dot notation eg: `a.b.c` |
+| hoodie.datasource.write.keygenerator.class | N | 
SimpleAvroKeyGenerator.class | Key generator class, that implements will 
extract the key out of incoming record |
+| write.tasks | N | 4 | Parallelism of tasks that do actual write, default is 
4 |
+| write.batch.size.MB | N | 128 | Batch buffer size in MB to flush data into 
the underneath filesystem |
+
+If the table type is MERGE_ON_READ, you can also specify the asynchronous 
compaction strategy through options:
+
+|  Option Name  | Required | Default | Remarks |
+|  -----------  | -------  | ------- | ------- |
+| compaction.async.enabled | N | true | Async Compaction, enabled by default 
for MOR |
+| compaction.trigger.strategy | N | num_commits | Strategy to trigger 
compaction, options are 'num_commits': trigger compaction when reach N delta 
commits; 'time_elapsed': trigger compaction when time elapsed > N seconds since 
last compaction; 'num_and_time': trigger compaction when both NUM_COMMITS and 
TIME_ELAPSED are satisfied; 'num_or_time': trigger compaction when NUM_COMMITS 
or TIME_ELAPSED is satisfied. Default is 'num_commits' |
+| compaction.delta_commits | N | 5 | Max delta commits needed to trigger 
compaction, default 5 commits |
+| compaction.delta_seconds | N | 3600 | Max delta seconds time needed to 
trigger compaction, default 1 hour |
+
+You can write the data using the SQL `INSERT INTO` statements:
+```sql
+INSERT INTO hudi_table select ... from ...; 
+```
+
+**Note**: INSERT OVERWRITE is not supported yet but already on the roadmap.
+
 ## Key Generation
 
 Hudi maintains hoodie keys (record key + partition path) for uniquely 
identifying a particular record. Key generator class will extract these out of 
incoming record. Both the tools above have configs to specify the 
diff --git a/docs/_docs/2_3_querying_data.md b/docs/_docs/2_3_querying_data.md
index 0af3418..ed6752d 100644
--- a/docs/_docs/2_3_querying_data.md
+++ b/docs/_docs/2_3_querying_data.md
@@ -40,6 +40,7 @@ Following tables show whether a given query is supported on 
specific query engin
 |**Hive**|Y|Y|
 |**Spark SQL**|Y|Y|
 |**Spark Datasource**|Y|Y|
+|**Flink SQL**|Y|N|
 |**PrestoDB**|Y|N|
 |**Impala**|Y|N|
 
@@ -53,6 +54,7 @@ Note that `Read Optimized` queries are not applicable for 
COPY_ON_WRITE tables.
 |**Hive**|Y|Y|Y|
 |**Spark SQL**|Y|Y|Y|
 |**Spark Datasource**|Y|N|Y|
+|**Flink SQL**|Y|Y|Y|
 |**PrestoDB**|Y|N|Y|
 |**Impala**|N|N|Y|
 
@@ -165,7 +167,7 @@ 
hudiIncQueryDF.createOrReplaceTempView("hudi_trips_incremental")
 spark.sql("select `_hoodie_commit_time`, fare, begin_lon, begin_lat, ts from  
hudi_trips_incremental where fare > 20.0").show()
 ```
 
-For examples, refer to [Setup spark-shell in 
quickstart](/docs/quick-start-guide.html#setup-spark-shell). 
+For examples, refer to [Setup spark-shell in 
quickstart](/docs/spark_quick-start-guide.html#setup-spark-shell). 
 Please refer to [configurations](/docs/configurations.html#spark-datasource) 
section, to view all datasource options.
 
 Additionally, `HoodieReadClient` offers the following functionality using 
Hudi's implicit indexing.
@@ -176,6 +178,47 @@ Additionally, `HoodieReadClient` offers the following 
functionality using Hudi's
 | filterExists() | Filter out already existing records from the provided 
`RDD[HoodieRecord]`. Useful for de-duplication |
 | checkExists(keys) | Check if the provided keys exist in a Hudi table |
 
+## Flink SQL
+Once the flink Hudi tables have been registered to the Flink catalog, it can 
be queried using the Flink SQL. It supports all query types across both Hudi 
table types,
+relying on the custom Hudi input formats again like Hive. Typically notebook 
users and Flink SQL CLI users leverage flink sql for querying Hudi tables. 
Please add hudi-flink-bundle as described above via --jars.
+
+By default, Flink SQL will try to use its own parquet reader instead of Hive 
SerDe when reading from Hive metastore parquet tables.
+
+```bash
+# HADOOP_HOME is your hadoop root directory after unpack the binary package.
+export HADOOP_CLASSPATH=`$HADOOP_HOME/bin/hadoop classpath`
+
+./bin/sql-client.sh embedded -j .../hudi-flink-bundle_2.1?-*.*.*.jar shell
+```
+
+```sql
+-- this defines a COPY_ON_WRITE table named 't1'
+CREATE TABLE t1(
+  uuid VARCHAR(20),
+  name VARCHAR(10),
+  age INT,
+  ts TIMESTAMP(3),
+  `partition` VARCHAR(20)
+)
+PARTITIONED BY (`partition`)
+WITH (
+  'connector' = 'hudi',
+  'path' = 'schema://base-path'
+);
+
+-- query the data
+select * from t1 where `partition` = 'par1';
+```
+
+Flink's built-in support parquet is used for both COPY_ON_WRITE and 
MERGE_ON_READ tables,
+additionally partition prune is applied by Flink engine internally if a 
partition path is specified
+in the filter. Filters push down is not supported yet (already on the roadmap).
+
+For MERGE_ON_READ table, in order to query hudi table as a streaming, you need 
to add option `'read.streaming.enabled' = 'true'`,
+when querying the table, a Flink streaming pipeline starts and never ends 
until the user cancel the job manually.
+You can specify the start commit with option `read.streaming.start-commit` and 
source monitoring interval with option
+`read.streaming.check-interval`.
+
 ## PrestoDB
 
 PrestoDB is a popular query engine, providing interactive query performance. 
PrestoDB currently supports snapshot querying on COPY_ON_WRITE tables. 
diff --git a/docs/_docs/2_4_configurations.md b/docs/_docs/2_4_configurations.md
index ff25998..ec35e64 100644
--- a/docs/_docs/2_4_configurations.md
+++ b/docs/_docs/2_4_configurations.md
@@ -11,6 +11,7 @@ This page covers the different ways of configuring your job 
to write/read Hudi t
 At a high level, you can control behaviour at few levels. 
 
 - **[Spark Datasource Configs](#spark-datasource)** : These configs control 
the Hudi Spark Datasource, providing ability to define keys/partitioning, pick 
out the write operation, specify how to merge records or choosing query type to 
read.
+- **[Flink SQL Configs](#flink-options)** : These configs control the Hudi 
Flink SQL source/sink connectors, providing ability to define record keys, pick 
out the write operation, specify how to merge records, enable/disable 
asynchronous compaction or choosing query type to read.
 - **[WriteClient Configs](#writeclient-configs)** : Internally, the Hudi 
datasource uses a RDD based `HoodieWriteClient` api to actually perform writes 
to storage. These configs provide deep control over lower level aspects like 
    file sizing, compression, parallelism, compaction, write schema, cleaning 
etc. Although Hudi provides sane defaults, from time-time these configs may 
need to be tweaked to optimize for specific workloads.
 - **[RecordPayload Config](#PAYLOAD_CLASS_OPT_KEY)** : This is the lowest 
level of customization offered by Hudi. Record payloads define how to produce 
new values to upsert based on incoming new record and 
@@ -171,6 +172,58 @@ Property: `hoodie.datasource.read.end.instanttime`, 
Default: latest instant (i.e
 Property: `hoodie.datasource.read.schema.use.end.instanttime`, Default: false 
<br/>
 <span style="color:grey"> Uses end instant schema when incrementally fetched 
data to. Default: users latest instant schema. </span>
 
+## Flink SQL Config Options {#flink-options}
+
+Flink jobs using the SQL can be configured through the options in `WITH` 
clause.
+The actual datasource level configs are listed below.
+
+### Write Options
+
+|  Option Name  | Required | Default | Remarks |
+|  -----------  | -------  | ------- | ------- |
+| `path` | Y | N/A | <span style="color:grey"> Base path for the target hoodie 
table. The path would be created if it does not exist, otherwise a hudi table 
expects to be initialized successfully </span> |
+| `table.type`  | N | COPY_ON_WRITE | <span style="color:grey"> Type of table 
to write. COPY_ON_WRITE (or) MERGE_ON_READ </span> |
+| `write.operation` | N | upsert | <span style="color:grey"> The write 
operation, that this write should do (insert or upsert is supported) </span> |
+| `write.precombine.field` | N | ts | <span style="color:grey"> Field used in 
preCombining before actual write. When two records have the same key value, we 
will pick the one with the largest value for the precombine field, determined 
by Object.compareTo(..) </span> |
+| `write.payload.class` | N | OverwriteWithLatestAvroPayload.class | <span 
style="color:grey"> Payload class used. Override this, if you like to roll your 
own merge logic, when upserting/inserting. This will render any value set for 
the option in-effective </span> |
+| `write.insert.drop.duplicates` | N | false | <span style="color:grey"> Flag 
to indicate whether to drop duplicates upon insert. By default insert will 
accept duplicates, to gain extra performance </span> |
+| `write.ignore.failed` | N | true | <span style="color:grey"> Flag to 
indicate whether to ignore any non exception error (e.g. writestatus error). 
within a checkpoint batch. By default true (in favor of streaming progressing 
over data integrity) </span> |
+| `hoodie.datasource.write.recordkey.field` | N | uuid | <span 
style="color:grey"> Record key field. Value to be used as the `recordKey` 
component of `HoodieKey`. Actual value will be obtained by invoking .toString() 
on the field value. Nested fields can be specified using the dot notation eg: 
`a.b.c` </span> |
+| `hoodie.datasource.write.keygenerator.class` | N | 
SimpleAvroKeyGenerator.class | <span style="color:grey"> Key generator class, 
that implements will extract the key out of incoming record </span> |
+| `write.tasks` | N | 4 | <span style="color:grey"> Parallelism of tasks that 
do actual write, default is 4 </span> |
+| `write.batch.size.MB` | N | 128 | <span style="color:grey"> Batch buffer 
size in MB to flush data into the underneath filesystem </span> |
+
+If the table type is MERGE_ON_READ, you can also specify the asynchronous 
compaction strategy through options:
+
+|  Option Name  | Required | Default | Remarks |
+|  -----------  | -------  | ------- | ------- |
+| `compaction.async.enabled` | N | true | <span style="color:grey"> Async 
Compaction, enabled by default for MOR </span> |
+| `compaction.trigger.strategy` | N | num_commits | <span style="color:grey"> 
Strategy to trigger compaction, options are 'num_commits': trigger compaction 
when reach N delta commits; 'time_elapsed': trigger compaction when time 
elapsed > N seconds since last compaction; 'num_and_time': trigger compaction 
when both NUM_COMMITS and TIME_ELAPSED are satisfied; 'num_or_time': trigger 
compaction when NUM_COMMITS or TIME_ELAPSED is satisfied. Default is 
'num_commits' </span> |
+| `compaction.delta_commits` | N | 5 | <span style="color:grey"> Max delta 
commits needed to trigger compaction, default 5 commits </span> |
+| `compaction.delta_seconds` | N | 3600 | <span style="color:grey"> Max delta 
seconds time needed to trigger compaction, default 1 hour </span> |
+
+### Read Options
+
+|  Option Name  | Required | Default | Remarks |
+|  -----------  | -------  | ------- | ------- |
+| `path` | Y | N/A | <span style="color:grey"> Base path for the target hoodie 
table. The path would be created if it does not exist, otherwise a hudi table 
expects to be initialized successfully </span> |
+| `table.type`  | N | COPY_ON_WRITE | <span style="color:grey"> Type of table 
to write. COPY_ON_WRITE (or) MERGE_ON_READ </span> |
+| `read.tasks` | N | 4 | <span style="color:grey"> Parallelism of tasks that 
do actual read, default is 4 </span> |
+| `read.avro-schema.path` | N | N/A | <span style="color:grey"> Avro schema 
file path, the parsed schema is used for deserialization, if not specified, the 
avro schema is inferred from the table DDL </span> |
+| `read.avro-schema` | N | N/A | <span style="color:grey"> Avro schema string, 
the parsed schema is used for deserialization, if not specified, the avro 
schema is inferred from the table DDL </span> |
+| `hoodie.datasource.query.type` | N | snapshot | <span style="color:grey"> 
Decides how data files need to be read, in 1) Snapshot mode (obtain latest 
view, based on row & columnar data); 2) incremental mode (new data since an 
instantTime), not supported yet; 3) Read Optimized mode (obtain latest view, 
based on columnar data). Default: snapshot </span> |
+| `hoodie.datasource.merge.type` | N | payload_combine | <span 
style="color:grey"> For Snapshot query on merge on read table. Use this key to 
define how the payloads are merged, in 1) skip_merge: read the base file 
records plus the log file records; 2) payload_combine: read the base file 
records first, for each record in base file, checks whether the key is in the 
log file records(combines the two records with same key for base and log file 
records), then read the left log file records < [...]
+| `hoodie.datasource.hive_style_partition` | N | false | <span 
style="color:grey"> Whether the partition path is with Hive style, e.g. 
'{partition key}={partition value}', default false </span> |
+| `read.utc-timezone` | N | true | <span style="color:grey"> Use UTC timezone 
or local timezone to the conversion between epoch time and LocalDateTime. Hive 
0.x/1.x/2.x use local timezone. But Hive 3.x use UTC timezone, by default true 
</span> |
+
+If the table type is MERGE_ON_READ, streaming read is supported through 
options:
+
+|  Option Name  | Required | Default | Remarks |
+|  -----------  | -------  | ------- | ------- |
+| `read.streaming.enabled` | N | false | <span style="color:grey"> Whether to 
read as streaming source, default false </span> |
+| `read.streaming.check-interval` | N | 60 | <span style="color:grey"> Check 
interval for streaming read of SECOND, default 1 minute </span> |
+| `read.streaming.start-commit` | N | N/A | <span style="color:grey"> Start 
commit instant for streaming read, the commit time format should be 
'yyyyMMddHHmmss', by default reading from the latest instant </span> |
+
 ## WriteClient Configs {#writeclient-configs}
 
 Jobs programming directly against the RDD level apis can build a 
`HoodieWriteConfig` object and pass it in to the `HoodieWriteClient` 
constructor. 
diff --git a/docs/_layouts/home.html b/docs/_layouts/home.html
index a5463a1..bbbfb14 100644
--- a/docs/_layouts/home.html
+++ b/docs/_layouts/home.html
@@ -20,7 +20,7 @@ layout: home
 
         <p class="page__lead">{{ page.excerpt }}</p>
         <p>
-          <a href="/docs/quick-start-guide.html" class="btn btn--light-outline 
btn--large"><i class="fa fa-paper-plane"></i> Get Started</a>
+          <a href="/docs/spark_quick-start-guide.html" class="btn 
btn--light-outline btn--large"><i class="fa fa-paper-plane"></i> Get Started</a>
         </p>
       </div>
     </div>
diff --git a/docs/_pages/contributing.cn.md b/docs/_pages/contributing.cn.md
index ca4b011..6c4e4e6 100644
--- a/docs/_pages/contributing.cn.md
+++ b/docs/_pages/contributing.cn.md
@@ -25,7 +25,7 @@ To contribute code, you need
 
 To contribute, you would need to do the following
  
- - Fork the Hudi code on Github & then clone your own fork locally. Once 
cloned, we recommend building as per instructions on 
[quickstart](/docs/quick-start-guide.html)
+ - Fork the Hudi code on Github & then clone your own fork locally. Once 
cloned, we recommend building as per instructions on [spark 
quickstart](/docs/spark_quick-start-guide.html) or [flink 
quickstart](/docs/flink-quick-start-guide.html)
  - [Recommended] We have embraced the code style largely based on [google 
format](https://google.github.io/styleguide/javaguide.html). Please setup your 
IDE with style files from 
[here](https://github.com/apache/hudi/tree/master/style).
 These instructions have been tested on IntelliJ. 
  - [Recommended] Set up the [Save Action 
Plugin](https://plugins.jetbrains.com/plugin/7642-save-actions) to auto format 
& organize imports on save. The Maven Compilation life-cycle will fail if there 
are checkstyle violations.
diff --git a/docs/_pages/contributing.md b/docs/_pages/contributing.md
index 2d30563..c8147f3 100644
--- a/docs/_pages/contributing.md
+++ b/docs/_pages/contributing.md
@@ -24,7 +24,7 @@ To contribute code, you need
 
 To contribute, you would need to do the following
  
-- Fork the Hudi code on Github & then clone your own fork locally. Once 
cloned, we recommend building as per instructions on 
[quickstart](/docs/quick-start-guide.html)
+- Fork the Hudi code on Github & then clone your own fork locally. Once 
cloned, we recommend building as per instructions on [spark 
quickstart](/docs/spark_quick-start-guide.html) or [flink 
quickstart](/docs/flink-quick-start-guide.html)
 
 - \[Recommended\] We have embraced the code style largely based on [google 
format](https://google.github.io/styleguide/javaguide.html). Please setup your 
IDE with style files from [\<project 
root\>/style/](https://github.com/apache/hudi/tree/master/style). These 
instructions have been tested on IntelliJ.
 
diff --git a/docs/_pages/releases.md b/docs/_pages/releases.md
index 7aa98ec..d2212c5 100644
--- a/docs/_pages/releases.md
+++ b/docs/_pages/releases.md
@@ -66,7 +66,7 @@ Specifically, the `HoodieFlinkStreamer` allows for Hudi 
Copy-On-Write table to b
   derived/ETL pipelines similar to data 
[sensors](https://airflow.apache.org/docs/apache-airflow/stable/_api/airflow/sensors/index.html)
 in Apache Airflow.
 - **Insert Overwrite/Insert Overwrite Table**: We have added these two new 
write operation types, predominantly to help existing batch ETL jobs, which 
typically overwrite entire 
   tables/partitions each run. These operations are much cheaper, than having 
to issue upserts, given they are bulk replacing the target table.
-  Check [here](/docs/quick-start-guide.html#insert-overwrite-table) for 
examples.
+  Check [here](/docs/spark_quick-start-guide.html#insert-overwrite-table) for 
examples.
 - **Delete Partition**: For users of WriteClient/RDD level apis, we have added 
an API to delete an entire partition, again without issuing deletes at the 
record level.
 - The current default `OverwriteWithLatestAvroPayload` will overwrite the 
value in storage, even if for e.g the upsert was reissued for an older value of 
the key.
   Added a new `DefaultHoodieRecordPayload` and a new payload config 
`hoodie.payload.ordering.field` helps specify a field, that the incoming upsert 
record can be compared with

[hudi] branch asf-site updated: [HUDI-1693] Add document about HUDI Flink integration (#2681)

Reply via email to