[hudi] branch asf-site updated: [HUDI-3230] Add streaming read for flink document (#4571)

danny0405 Wed, 12 Jan 2022 01:06:31 -0800

This is an automated email from the ASF dual-hosted git repository.

danny0405 pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/hudi.git



The following commit(s) were added to refs/heads/asf-site by this push:
     new 7374326  [HUDI-3230] Add streaming read for flink document (#4571)
7374326 is described below

commit 7374326ccc160367c566e7052eab34f6b0ee556e
Author: Danny Chan <yuzhao....@gmail.com>
AuthorDate: Wed Jan 12 17:05:35 2022 +0800

    [HUDI-3230] Add streaming read for flink document (#4571)
---
 website/docs/flink-quick-start-guide.md            | 11 ++++++-----
 website/docs/hoodie_deltastreamer.md               | 22 ++++++++++++++++++++--
 .../version-0.10.0/flink-quick-start-guide.md      | 11 ++++++-----
 .../version-0.10.0/hoodie_deltastreamer.md         | 22 ++++++++++++++++++++--
 4 files changed, 52 insertions(+), 14 deletions(-)

diff --git a/website/docs/flink-quick-start-guide.md 
b/website/docs/flink-quick-start-guide.md
index d5dd05d..323acad 100644
--- a/website/docs/flink-quick-start-guide.md
+++ b/website/docs/flink-quick-start-guide.md
@@ -4,15 +4,16 @@ toc: true
 last_modified_at: 2020-08-12T15:19:57+08:00
 ---
 
-This guide provides a document at Hudi's capabilities using Flink SQL. We can 
feel the unique charm of Flink stream computing engine on Hudi.
-Reading this guide, you can quickly start using Flink to write to(read from) 
Hudi, have a deeper understanding of configuration and optimization:
+This guide provides an instruction for Flink Hudi integration. We can feel the 
unique charm of how Flink brings in the power of streaming into Hudi.
+Reading this guide, you can quickly start using Flink on Hudi, learn different 
modes for reading/writing Hudi by Flink:
 
 - **Quick Start** : Read [Quick Start](#quick-start) to get started quickly 
Flink sql client to write to(read from) Hudi.
-- **Configuration** : For [Flink 
Configuration](flink_configuration#global-configurations), sets up through 
`$FLINK_HOME/conf/flink-conf.yaml`. For per job configuration, sets up through 
[Table Option](flink_configuration#table-options).
-- **Writing Data** : Flink supports different writing data use cases, such as 
[CDC Ingestion](hoodie_deltastreamer#cdc-ingestion), [Bulk 
Insert](hoodie_deltastreamer#bulk-insert), [Index 
Bootstrap](hoodie_deltastreamer#index-bootstrap), [Changelog 
Mode](hoodie_deltastreamer#changelog-mode) and [Append 
Mode](hoodie_deltastreamer#append-mode).
-- **Querying Data** : Flink supports different querying data use cases, such 
as [Incremental Query](hoodie_deltastreamer#incremental-query), [Hive 
Query](syncing_metastore#flink-setup), [Presto 
Query](query_engine_setup#prestodb).
+- **Configuration** : For [Global 
Configuration](flink_configuration#global-configurations), sets up through 
`$FLINK_HOME/conf/flink-conf.yaml`. For per job configuration, sets up through 
[Table Option](flink_configuration#table-options).
+- **Writing Data** : Flink supports different modes for writing, such as [CDC 
Ingestion](hoodie_deltastreamer#cdc-ingestion), [Bulk 
Insert](hoodie_deltastreamer#bulk-insert), [Index 
Bootstrap](hoodie_deltastreamer#index-bootstrap), [Changelog 
Mode](hoodie_deltastreamer#changelog-mode) and [Append 
Mode](hoodie_deltastreamer#append-mode).
+- **Querying Data** : Flink supports different modes for reading, such as 
[Streaming Query](hoodie_deltastreamer#streaming-query) and [Incremental 
Query](hoodie_deltastreamer#incremental-query).
 - **Tuning** : For write/read tasks, this guide gives some tuning suggestions, 
such as [Memory Optimization](flink_configuration#memory-optimization) and 
[Write Rate Limit](flink_configuration#write-rate-limit).
 - **Optimization**: Offline compaction is supported [Offline 
Compaction](compaction#flink-offline-compaction).
+- **Query Engines**: Besides Flink, many other engines are integrated: [Hive 
Query](syncing_metastore#flink-setup), [Presto 
Query](query_engine_setup#prestodb).
 
 ## Quick Start
 
diff --git a/website/docs/hoodie_deltastreamer.md 
b/website/docs/hoodie_deltastreamer.md
index a979788..f212f57 100644
--- a/website/docs/hoodie_deltastreamer.md
+++ b/website/docs/hoodie_deltastreamer.md
@@ -462,6 +462,24 @@ There are many use cases that user put the full history 
data set onto the messag
 |  -----------  | -------  | ------- | ------- |
 | `write.rate.limit` | `false` | `0` | Default disable the rate limit |
 
+### Streaming Query
+By default, the hoodie table is read as batch, that is to read the latest 
snapshot data set and returns. Turns on the streaming read
+mode by setting option `read.streaming.enabled` as `true`. Sets up option 
`read.start-commit` to specify the read start offset, specifies the
+value as `earliest` if you want to consume all the history data set.
+
+#### Options
+|  Option Name  | Required | Default | Remarks |
+|  -----------  | -------  | ------- | ------- |
+| `read.streaming.enabled` | false | `false` | Specify `true` to read as 
streaming |
+| `read.start-commit` | false | the latest commit | Start commit time in 
format 'yyyyMMddHHmmss', use `earliest` to consume from the start commit |
+| `read.streaming.skip_compaction` | false | `false` | Whether to skip 
compaction commits while reading, generally for two purposes: 1) Avoid 
consuming duplications from the compaction instants 2) When change log mode is 
enabled, to only consume change logs for right semantics. |
+| `clean.retain_commits` | false | `10` | The max number of commits to retain 
before cleaning, when change log mode is enabled, tweaks this option to adjust 
the change log live time. For example, the default strategy keeps 50 minutes of 
change logs if the checkpoint interval is set up as 5 minutes. |
+
+:::note
+When option `read.streaming.skip_compaction` turns on and the streaming reader 
lags behind by commits of number
+`clean.retain_commits`, the data loss may occur.
+:::
+
 ### Incremental Query
 There are 3 use cases for incremental query:
 1. Streaming query: specify the start commit with option `read.start-commit`;
@@ -472,8 +490,8 @@ There are 3 use cases for incremental query:
 #### Options
 |  Option Name  | Required | Default | Remarks |
 |  -----------  | -------  | ------- | ------- |
-| `write.start-commit` | `false` | the latest commit | Specify `earliest` to 
consume from the start commit |
-| `write.end-commit` | `false` | the latest commit | -- |
+| `read.start-commit` | `false` | the latest commit | Specify `earliest` to 
consume from the start commit |
+| `read.end-commit` | `false` | the latest commit | -- |
 
 ## Kafka Connect Sink
 If you want to perform streaming ingestion into Hudi format similar to 
HoodieDeltaStreamer, but you don't want to depend on Spark,
diff --git a/website/versioned_docs/version-0.10.0/flink-quick-start-guide.md 
b/website/versioned_docs/version-0.10.0/flink-quick-start-guide.md
index d5dd05d..323acad 100644
--- a/website/versioned_docs/version-0.10.0/flink-quick-start-guide.md
+++ b/website/versioned_docs/version-0.10.0/flink-quick-start-guide.md
@@ -4,15 +4,16 @@ toc: true
 last_modified_at: 2020-08-12T15:19:57+08:00
 ---
 
-This guide provides a document at Hudi's capabilities using Flink SQL. We can 
feel the unique charm of Flink stream computing engine on Hudi.
-Reading this guide, you can quickly start using Flink to write to(read from) 
Hudi, have a deeper understanding of configuration and optimization:
+This guide provides an instruction for Flink Hudi integration. We can feel the 
unique charm of how Flink brings in the power of streaming into Hudi.
+Reading this guide, you can quickly start using Flink on Hudi, learn different 
modes for reading/writing Hudi by Flink:
 
 - **Quick Start** : Read [Quick Start](#quick-start) to get started quickly 
Flink sql client to write to(read from) Hudi.
-- **Configuration** : For [Flink 
Configuration](flink_configuration#global-configurations), sets up through 
`$FLINK_HOME/conf/flink-conf.yaml`. For per job configuration, sets up through 
[Table Option](flink_configuration#table-options).
-- **Writing Data** : Flink supports different writing data use cases, such as 
[CDC Ingestion](hoodie_deltastreamer#cdc-ingestion), [Bulk 
Insert](hoodie_deltastreamer#bulk-insert), [Index 
Bootstrap](hoodie_deltastreamer#index-bootstrap), [Changelog 
Mode](hoodie_deltastreamer#changelog-mode) and [Append 
Mode](hoodie_deltastreamer#append-mode).
-- **Querying Data** : Flink supports different querying data use cases, such 
as [Incremental Query](hoodie_deltastreamer#incremental-query), [Hive 
Query](syncing_metastore#flink-setup), [Presto 
Query](query_engine_setup#prestodb).
+- **Configuration** : For [Global 
Configuration](flink_configuration#global-configurations), sets up through 
`$FLINK_HOME/conf/flink-conf.yaml`. For per job configuration, sets up through 
[Table Option](flink_configuration#table-options).
+- **Writing Data** : Flink supports different modes for writing, such as [CDC 
Ingestion](hoodie_deltastreamer#cdc-ingestion), [Bulk 
Insert](hoodie_deltastreamer#bulk-insert), [Index 
Bootstrap](hoodie_deltastreamer#index-bootstrap), [Changelog 
Mode](hoodie_deltastreamer#changelog-mode) and [Append 
Mode](hoodie_deltastreamer#append-mode).
+- **Querying Data** : Flink supports different modes for reading, such as 
[Streaming Query](hoodie_deltastreamer#streaming-query) and [Incremental 
Query](hoodie_deltastreamer#incremental-query).
 - **Tuning** : For write/read tasks, this guide gives some tuning suggestions, 
such as [Memory Optimization](flink_configuration#memory-optimization) and 
[Write Rate Limit](flink_configuration#write-rate-limit).
 - **Optimization**: Offline compaction is supported [Offline 
Compaction](compaction#flink-offline-compaction).
+- **Query Engines**: Besides Flink, many other engines are integrated: [Hive 
Query](syncing_metastore#flink-setup), [Presto 
Query](query_engine_setup#prestodb).
 
 ## Quick Start
 
diff --git a/website/versioned_docs/version-0.10.0/hoodie_deltastreamer.md 
b/website/versioned_docs/version-0.10.0/hoodie_deltastreamer.md
index a979788..f212f57 100644
--- a/website/versioned_docs/version-0.10.0/hoodie_deltastreamer.md
+++ b/website/versioned_docs/version-0.10.0/hoodie_deltastreamer.md
@@ -462,6 +462,24 @@ There are many use cases that user put the full history 
data set onto the messag
 |  -----------  | -------  | ------- | ------- |
 | `write.rate.limit` | `false` | `0` | Default disable the rate limit |
 
+### Streaming Query
+By default, the hoodie table is read as batch, that is to read the latest 
snapshot data set and returns. Turns on the streaming read
+mode by setting option `read.streaming.enabled` as `true`. Sets up option 
`read.start-commit` to specify the read start offset, specifies the
+value as `earliest` if you want to consume all the history data set.
+
+#### Options
+|  Option Name  | Required | Default | Remarks |
+|  -----------  | -------  | ------- | ------- |
+| `read.streaming.enabled` | false | `false` | Specify `true` to read as 
streaming |
+| `read.start-commit` | false | the latest commit | Start commit time in 
format 'yyyyMMddHHmmss', use `earliest` to consume from the start commit |
+| `read.streaming.skip_compaction` | false | `false` | Whether to skip 
compaction commits while reading, generally for two purposes: 1) Avoid 
consuming duplications from the compaction instants 2) When change log mode is 
enabled, to only consume change logs for right semantics. |
+| `clean.retain_commits` | false | `10` | The max number of commits to retain 
before cleaning, when change log mode is enabled, tweaks this option to adjust 
the change log live time. For example, the default strategy keeps 50 minutes of 
change logs if the checkpoint interval is set up as 5 minutes. |
+
+:::note
+When option `read.streaming.skip_compaction` turns on and the streaming reader 
lags behind by commits of number
+`clean.retain_commits`, the data loss may occur.
+:::
+
 ### Incremental Query
 There are 3 use cases for incremental query:
 1. Streaming query: specify the start commit with option `read.start-commit`;
@@ -472,8 +490,8 @@ There are 3 use cases for incremental query:
 #### Options
 |  Option Name  | Required | Default | Remarks |
 |  -----------  | -------  | ------- | ------- |
-| `write.start-commit` | `false` | the latest commit | Specify `earliest` to 
consume from the start commit |
-| `write.end-commit` | `false` | the latest commit | -- |
+| `read.start-commit` | `false` | the latest commit | Specify `earliest` to 
consume from the start commit |
+| `read.end-commit` | `false` | the latest commit | -- |
 
 ## Kafka Connect Sink
 If you want to perform streaming ingestion into Hudi format similar to 
HoodieDeltaStreamer, but you don't want to depend on Spark,

[hudi] branch asf-site updated: [HUDI-3230] Add streaming read for flink document (#4571)

Reply via email to