(doris-website) branch master updated: [doc][doc] small fixes (#929)

luzhijing Tue, 30 Jul 2024 03:14:38 -0700

This is an automated email from the ASF dual-hosted git repository.

luzhijing pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/doris-website.git



The following commit(s) were added to refs/heads/master by this push:
     new 501bb30c296 [doc][doc] small fixes (#929)
501bb30c296 is described below

commit 501bb30c2964192696f7c6cae1e44e71ef25c6ba
Author: Hu Yanjun <100749531+httpshir...@users.noreply.github.com>
AuthorDate: Tue Jul 30 18:13:26 2024 +0800

    [doc][doc] small fixes (#929)
---
 docs/get-starting/quick-start/doris-hudi.md                | 14 +++++++-------
 docs/get-starting/quick-start/doris-paimon.md              |  4 ++--
 .../current/get-starting/quick-start/doris-paimon.md       |  7 +++----
 .../version-2.1/get-starting/quick-start/doris-paimon.md   |  6 +++---
 .../version-3.0/get-starting/quick-start/doris-paimon.md   |  6 +++---
 .../version-2.1/get-starting/quick-start/doris-hudi.md     | 14 +++++++-------
 .../version-2.1/get-starting/quick-start/doris-paimon.md   |  4 ++--
 .../version-3.0/get-starting/quick-start/doris-hudi.md     | 14 +++++++-------
 .../version-3.0/get-starting/quick-start/doris-paimon.md   |  4 ++--
 9 files changed, 36 insertions(+), 37 deletions(-)

diff --git a/docs/get-starting/quick-start/doris-hudi.md 
b/docs/get-starting/quick-start/doris-hudi.md
index f7426e3a3fc..b640c5e8f92 100644
--- a/docs/get-starting/quick-start/doris-hudi.md
+++ b/docs/get-starting/quick-start/doris-hudi.md
@@ -31,7 +31,7 @@ In recent versions, Apache Doris has deepened its integration 
with data lakes an
 
 - Since version 0.15, Apache Doris has introduced Hive and Iceberg external 
tables, exploring the capabilities of combining with Apache Iceberg for data 
lakes.
 - Starting from version 1.2, Apache Doris officially introduced the 
Multi-Catalog feature, enabling automatic metadata mapping and data access for 
various data sources, along with numerous performance optimizations for 
external data reading and query execution. It now fully possesses the ability 
to build a high-speed and user-friendly Lakehouse architecture.
-- In version 2.1, Apache Doris's Data Lakehouse architecture was significantly 
enhanced, improving the reading and writing capabilities of mainstream data 
lake formats (Hudi, Iceberg, Paimon, etc.), introducing compatibility with 
multiple SQL dialects, and seamless migration from existing systems to Apache 
Doris. For data science and large-scale data reading scenarios, Doris 
integrated the Arrow Flight high-speed reading interface, achieving a 100-fold 
increase in data transfer efficiency.
+- In version 2.1, Apache Doris' Data Lakehouse architecture was significantly 
enhanced, improving the reading and writing capabilities of mainstream data 
lake formats (Hudi, Iceberg, Paimon, etc.), introducing compatibility with 
multiple SQL dialects, and seamless migration from existing systems to Apache 
Doris. For data science and large-scale data reading scenarios, Doris 
integrated the Arrow Flight high-speed reading interface, achieving a 100-fold 
increase in data transfer efficiency.
 
 ![](/images/quick-start/lakehouse-arch.PNG)
 
@@ -46,12 +46,12 @@ Apache Doris has also enhanced its ability to read Apache 
Hudi data tables:
 - Supports Time Travel
 - Supports Incremental Read
 
-With Apache Doris's high-performance query execution and Apache Hudi's 
real-time data management capabilities, efficient, flexible, and cost-effective 
data querying and analysis can be achieved. It also provides robust data 
lineage, auditing, and incremental processing functionalities. The combination 
of Apache Doris and Apache Hudi has been validated and promoted in real 
business scenarios by multiple community users:
+With Apache Doris' high-performance query execution and Apache Hudi's 
real-time data management capabilities, efficient, flexible, and cost-effective 
data querying and analysis can be achieved. It also provides robust data 
lineage, auditing, and incremental processing functionalities. The combination 
of Apache Doris and Apache Hudi has been validated and promoted in real 
business scenarios by multiple community users:
 
 - Real-time data analysis and processing: Common scenarios such as real-time 
data updates and query analysis in industries like finance, advertising, and 
e-commerce require real-time data processing. Hudi enables real-time data 
updates and management while ensuring data consistency and reliability. Doris 
efficiently handles large-scale data query requests in real-time, meeting the 
demands of real-time data analysis and processing effectively when combined.
-- Data lineage and auditing: For industries with high requirements for data 
security and accuracy like finance and healthcare, data lineage and auditing 
are crucial functionalities. Hudi offers Time Travel functionality for viewing 
historical data states, combined with Apache Doris's efficient querying 
capabilities, enabling quick analysis of data at any point in time for precise 
lineage and auditing.
-- Incremental data reading and analysis: Large-scale data analysis often faces 
challenges of large data volumes and frequent updates. Hudi supports 
incremental data reading, allowing users to process only the changed data 
without full data updates. Additionally, Apache Doris's Incremental Read 
feature enhances this process, significantly improving data processing and 
analysis efficiency.
-- Cross-data source federated queries: Many enterprises have complex data 
sources stored in different databases. Doris's Multi-Catalog feature supports 
automatic mapping and synchronization of various data sources, enabling 
federated queries across data sources. This greatly shortens the data flow path 
and enhances work efficiency for enterprises needing to retrieve and integrate 
data from multiple sources for analysis.
+- Data lineage and auditing: For industries with high requirements for data 
security and accuracy like finance and healthcare, data lineage and auditing 
are crucial functionalities. Hudi offers Time Travel functionality for viewing 
historical data states, combined with Apache Doris' efficient querying 
capabilities, enabling quick analysis of data at any point in time for precise 
lineage and auditing.
+- Incremental data reading and analysis: Large-scale data analysis often faces 
challenges of large data volumes and frequent updates. Hudi supports 
incremental data reading, allowing users to process only the changed data 
without full data updates. Additionally, Apache Doris' Incremental Read feature 
enhances this process, significantly improving data processing and analysis 
efficiency.
+- Cross-data source federated queries: Many enterprises have complex data 
sources stored in different databases. Doris' Multi-Catalog feature supports 
automatic mapping and synchronization of various data sources, enabling 
federated queries across data sources. This greatly shortens the data flow path 
and enhances work efficiency for enterprises needing to retrieve and integrate 
data from multiple sources for analysis.
 
 This article will introduce readers to how to quickly set up a test and 
demonstration environment for Apache Doris + Apache Hudi in a Docker 
environment, and demonstrate various operations to help readers get started 
quickly.
 
@@ -258,9 +258,9 @@ spark-sql> select * from customer_mor timestamp as of 
'20240603015058442' where
 
 Data in Apache Hudi can be roughly divided into two categories - baseline data 
and incremental data. Baseline data is typically merged Parquet files, while 
incremental data refers to data increments generated by INSERT, UPDATE, or 
DELETE operations. Baseline data can be read directly, while incremental data 
needs to be read through Merge on Read.
 
-For querying Hudi COW tables or Read Optimized queries on MOR tables, the data 
belongs to baseline data and can be directly read using Doris's native Parquet 
Reader, providing fast query responses. For incremental data, Doris needs to 
access Hudi's Java SDK through JNI calls. To achieve optimal query performance, 
Apache Doris divides the data in a query into baseline and incremental data 
parts and reads them using the aforementioned methods.
+For querying Hudi COW tables or Read Optimized queries on MOR tables, the data 
belongs to baseline data and can be directly read using Doris' native Parquet 
Reader, providing fast query responses. For incremental data, Doris needs to 
access Hudi's Java SDK through JNI calls. To achieve optimal query performance, 
Apache Doris divides the data in a query into baseline and incremental data 
parts and reads them using the aforementioned methods.
 
-To verify this optimization approach, we can use the EXPLAIN statement to see 
how many baseline and incremental data are present in a query example below. 
For a COW table, all 101 data shards are baseline data 
(`hudiNativeReadSplits=101/101`), so the COW table can be entirely read 
directly using Doris's Parquet Reader, resulting in the best query performance. 
For a ROW table, most data shards are baseline data 
(`hudiNativeReadSplits=100/101`), with one shard being incremental data, which 
[...]
+To verify this optimization approach, we can use the EXPLAIN statement to see 
how many baseline and incremental data are present in a query example below. 
For a COW table, all 101 data shards are baseline data 
(`hudiNativeReadSplits=101/101`), so the COW table can be entirely read 
directly using Doris' Parquet Reader, resulting in the best query performance. 
For a ROW table, most data shards are baseline data 
(`hudiNativeReadSplits=100/101`), with one shard being incremental data, which  
[...]
 
 ```
 -- COW table is read natively
diff --git a/docs/get-starting/quick-start/doris-paimon.md 
b/docs/get-starting/quick-start/doris-paimon.md
index cf1956c6e2d..27175dfe180 100644
--- a/docs/get-starting/quick-start/doris-paimon.md
+++ b/docs/get-starting/quick-start/doris-paimon.md
@@ -228,12 +228,12 @@ We conducted a simple test on the TPCDS 1000 dataset in 
Paimon (0.8) version, us
 
 ![](/images/quick-start/lakehouse-paimon-benchmark.PNG)
 
-From the test results, it can be seen that Doris's average query performance 
on the standard static test set is 3-5 times that of Trino. In the future, we 
will optimize the Deletion Vector to further improve query efficiency in real 
business scenarios.
+From the test results, it can be seen that Doris' average query performance on 
the standard static test set is 3-5 times that of Trino. In the future, we will 
optimize the Deletion Vector to further improve query efficiency in real 
business scenarios.
 
 ## Query Optimization
 
 For baseline data, after introducing the Primary Key Table Read Optimized 
feature in Apache Paimon version 0.6, the query engine can directly access the 
underlying Parquet/ORC files, significantly improving the reading efficiency of 
baseline data. For unmerged incremental data (data increments generated by 
INSERT, UPDATE, or DELETE), they can be read through Merge-on-Read. In 
addition, Paimon introduced the Deletion Vector feature in version 0.8, which 
further enhances the query engine's [...]
-Apache Doris supports reading Deletion Vector through native Reader and 
performing Merge on Read. We demonstrate the query methods for baseline data 
and incremental data in a query using Doris's EXPLAIN statement.
+Apache Doris supports reading Deletion Vector through native Reader and 
performing Merge on Read. We demonstrate the query methods for baseline data 
and incremental data in a query using Doris' EXPLAIN statement.
 
 ```
 mysql> explain verbose select * from customer where c_nationkey < 3;
diff --git 
a/i18n/zh-CN/docusaurus-plugin-content-docs/current/get-starting/quick-start/doris-paimon.md
 
b/i18n/zh-CN/docusaurus-plugin-content-docs/current/get-starting/quick-start/doris-paimon.md
index 68d6fba0c90..bdf306d6fa4 100644
--- 
a/i18n/zh-CN/docusaurus-plugin-content-docs/current/get-starting/quick-start/doris-paimon.md
+++ 
b/i18n/zh-CN/docusaurus-plugin-content-docs/current/get-starting/quick-start/doris-paimon.md
@@ -156,7 +156,7 @@ Flink SQL> select * from customer order by c_custkey limit 
4;
 
 ### 04 数据查询
 
-如下所示，Doris 集群中已经创建了名为paimon 的 Catalog（可通过 SHOW CATALOGS 查看）。以下为该 Catalog 的创建语句：
+如下所示，Doris 集群中已经创建了名为 `paimon` 的 Catalog（可通过 SHOW CATALOGS 查看）。以下为该 Catalog 
的创建语句：
 
 ```
 -- 已创建，无需执行
@@ -228,7 +228,7 @@ mysql> select * from customer where c_nationkey=1 limit 2;
 
 ![](/images/quick-start/lakehouse-paimon-benchmark.PNG)
 
-从测试结果可以看到，Doris 在标准静态测试集上的平均查询性能是 Trino 的 3 -5 倍。后续我们将针对 Deletion Vector 
进行优化，进一步提升真实业务场景下的查询效率。 
+从测试结果可以看到，Doris 在标准静态测试集上的平均查询性能是 Trino 的 3~5 倍。后续我们将针对 Deletion Vector 
进行优化，进一步提升真实业务场景下的查询效率。 
 
 ## 查询优化
 
@@ -266,5 +266,4 @@ mysql> explain verbose select * from customer where 
c_nationkey < 3;
 67 rows in set (0.23 sec)
 ```
 
-可以看到，对于刚才通过 Flink SQL 更新的表，包含 4 个分片，并且全部分片都可以通过 Native Reader 
进行访问（paimonNativeReadSplits=4/4）。并且第一个分片的hasDeletionVector的属性为 true，表示该分片有对应的 
Deletion Vector，读取时会根据 Deletion Vector 进行数据过滤。
-
+可以看到，对于刚才通过 Flink SQL 更新的表，包含 4 个分片，并且全部分片都可以通过 Native Reader 
进行访问（`paimonNativeReadSplits=4/4`）。并且第一个分片的 `hasDeletionVector` 的属性为 
`true`，表示该分片有对应的 Deletion Vector，读取时会根据 Deletion Vector 进行数据过滤。
diff --git 
a/i18n/zh-CN/docusaurus-plugin-content-docs/version-2.1/get-starting/quick-start/doris-paimon.md
 
b/i18n/zh-CN/docusaurus-plugin-content-docs/version-2.1/get-starting/quick-start/doris-paimon.md
index c46e7531e19..a1db993d470 100644
--- 
a/i18n/zh-CN/docusaurus-plugin-content-docs/version-2.1/get-starting/quick-start/doris-paimon.md
+++ 
b/i18n/zh-CN/docusaurus-plugin-content-docs/version-2.1/get-starting/quick-start/doris-paimon.md
@@ -156,7 +156,7 @@ Flink SQL> select * from customer order by c_custkey limit 
4;
 
 ### 04 数据查询
 
-如下所示，Doris 集群中已经创建了名为paimon 的 Catalog（可通过 SHOW CATALOGS 查看）。以下为该 Catalog 的创建语句：
+如下所示，Doris 集群中已经创建了名为 `paimon` 的 Catalog（可通过 SHOW CATALOGS 查看）。以下为该 Catalog 
的创建语句：
 
 ```
 -- 已创建，无需执行
@@ -228,7 +228,7 @@ mysql> select * from customer where c_nationkey=1 limit 2;
 
 ![](/images/quick-start/lakehouse-paimon-benchmark.PNG)
 
-从测试结果可以看到，Doris 在标准静态测试集上的平均查询性能是 Trino 的 3 -5 倍。后续我们将针对 Deletion Vector 
进行优化，进一步提升真实业务场景下的查询效率。 
+从测试结果可以看到，Doris 在标准静态测试集上的平均查询性能是 Trino 的 3~5 倍。后续我们将针对 Deletion Vector 
进行优化，进一步提升真实业务场景下的查询效率。 
 
 ## 查询优化
 
@@ -266,4 +266,4 @@ mysql> explain verbose select * from customer where 
c_nationkey < 3;
 67 rows in set (0.23 sec)
 ```
 
-可以看到，对于刚才通过 Flink SQL 更新的表，包含 4 个分片，并且全部分片都可以通过 Native Reader 
进行访问（paimonNativeReadSplits=4/4）。并且第一个分片的hasDeletionVector的属性为 true，表示该分片有对应的 
Deletion Vector，读取时会根据 Deletion Vector 进行数据过滤。
\ No newline at end of file
+可以看到，对于刚才通过 Flink SQL 更新的表，包含 4 个分片，并且全部分片都可以通过 Native Reader 
进行访问（`paimonNativeReadSplits=4/4`）。并且第一个分片的`hasDeletionVector`的属性为`true`，表示该分片有对应的
 Deletion Vector，读取时会根据 Deletion Vector 进行数据过滤。
\ No newline at end of file
diff --git 
a/i18n/zh-CN/docusaurus-plugin-content-docs/version-3.0/get-starting/quick-start/doris-paimon.md
 
b/i18n/zh-CN/docusaurus-plugin-content-docs/version-3.0/get-starting/quick-start/doris-paimon.md
index c46e7531e19..a1db993d470 100644
--- 
a/i18n/zh-CN/docusaurus-plugin-content-docs/version-3.0/get-starting/quick-start/doris-paimon.md
+++ 
b/i18n/zh-CN/docusaurus-plugin-content-docs/version-3.0/get-starting/quick-start/doris-paimon.md
@@ -156,7 +156,7 @@ Flink SQL> select * from customer order by c_custkey limit 
4;
 
 ### 04 数据查询
 
-如下所示，Doris 集群中已经创建了名为paimon 的 Catalog（可通过 SHOW CATALOGS 查看）。以下为该 Catalog 的创建语句：
+如下所示，Doris 集群中已经创建了名为 `paimon` 的 Catalog（可通过 SHOW CATALOGS 查看）。以下为该 Catalog 
的创建语句：
 
 ```
 -- 已创建，无需执行
@@ -228,7 +228,7 @@ mysql> select * from customer where c_nationkey=1 limit 2;
 
 ![](/images/quick-start/lakehouse-paimon-benchmark.PNG)
 
-从测试结果可以看到，Doris 在标准静态测试集上的平均查询性能是 Trino 的 3 -5 倍。后续我们将针对 Deletion Vector 
进行优化，进一步提升真实业务场景下的查询效率。 
+从测试结果可以看到，Doris 在标准静态测试集上的平均查询性能是 Trino 的 3~5 倍。后续我们将针对 Deletion Vector 
进行优化，进一步提升真实业务场景下的查询效率。 
 
 ## 查询优化
 
@@ -266,4 +266,4 @@ mysql> explain verbose select * from customer where 
c_nationkey < 3;
 67 rows in set (0.23 sec)
 ```
 
-可以看到，对于刚才通过 Flink SQL 更新的表，包含 4 个分片，并且全部分片都可以通过 Native Reader 
进行访问（paimonNativeReadSplits=4/4）。并且第一个分片的hasDeletionVector的属性为 true，表示该分片有对应的 
Deletion Vector，读取时会根据 Deletion Vector 进行数据过滤。
\ No newline at end of file
+可以看到，对于刚才通过 Flink SQL 更新的表，包含 4 个分片，并且全部分片都可以通过 Native Reader 
进行访问（`paimonNativeReadSplits=4/4`）。并且第一个分片的`hasDeletionVector`的属性为`true`，表示该分片有对应的
 Deletion Vector，读取时会根据 Deletion Vector 进行数据过滤。
\ No newline at end of file
diff --git a/versioned_docs/version-2.1/get-starting/quick-start/doris-hudi.md 
b/versioned_docs/version-2.1/get-starting/quick-start/doris-hudi.md
index f7426e3a3fc..b640c5e8f92 100644
--- a/versioned_docs/version-2.1/get-starting/quick-start/doris-hudi.md
+++ b/versioned_docs/version-2.1/get-starting/quick-start/doris-hudi.md
@@ -31,7 +31,7 @@ In recent versions, Apache Doris has deepened its integration 
with data lakes an
 
 - Since version 0.15, Apache Doris has introduced Hive and Iceberg external 
tables, exploring the capabilities of combining with Apache Iceberg for data 
lakes.
 - Starting from version 1.2, Apache Doris officially introduced the 
Multi-Catalog feature, enabling automatic metadata mapping and data access for 
various data sources, along with numerous performance optimizations for 
external data reading and query execution. It now fully possesses the ability 
to build a high-speed and user-friendly Lakehouse architecture.
-- In version 2.1, Apache Doris's Data Lakehouse architecture was significantly 
enhanced, improving the reading and writing capabilities of mainstream data 
lake formats (Hudi, Iceberg, Paimon, etc.), introducing compatibility with 
multiple SQL dialects, and seamless migration from existing systems to Apache 
Doris. For data science and large-scale data reading scenarios, Doris 
integrated the Arrow Flight high-speed reading interface, achieving a 100-fold 
increase in data transfer efficiency.
+- In version 2.1, Apache Doris' Data Lakehouse architecture was significantly 
enhanced, improving the reading and writing capabilities of mainstream data 
lake formats (Hudi, Iceberg, Paimon, etc.), introducing compatibility with 
multiple SQL dialects, and seamless migration from existing systems to Apache 
Doris. For data science and large-scale data reading scenarios, Doris 
integrated the Arrow Flight high-speed reading interface, achieving a 100-fold 
increase in data transfer efficiency.
 
 ![](/images/quick-start/lakehouse-arch.PNG)
 
@@ -46,12 +46,12 @@ Apache Doris has also enhanced its ability to read Apache 
Hudi data tables:
 - Supports Time Travel
 - Supports Incremental Read
 
-With Apache Doris's high-performance query execution and Apache Hudi's 
real-time data management capabilities, efficient, flexible, and cost-effective 
data querying and analysis can be achieved. It also provides robust data 
lineage, auditing, and incremental processing functionalities. The combination 
of Apache Doris and Apache Hudi has been validated and promoted in real 
business scenarios by multiple community users:
+With Apache Doris' high-performance query execution and Apache Hudi's 
real-time data management capabilities, efficient, flexible, and cost-effective 
data querying and analysis can be achieved. It also provides robust data 
lineage, auditing, and incremental processing functionalities. The combination 
of Apache Doris and Apache Hudi has been validated and promoted in real 
business scenarios by multiple community users:
 
 - Real-time data analysis and processing: Common scenarios such as real-time 
data updates and query analysis in industries like finance, advertising, and 
e-commerce require real-time data processing. Hudi enables real-time data 
updates and management while ensuring data consistency and reliability. Doris 
efficiently handles large-scale data query requests in real-time, meeting the 
demands of real-time data analysis and processing effectively when combined.
-- Data lineage and auditing: For industries with high requirements for data 
security and accuracy like finance and healthcare, data lineage and auditing 
are crucial functionalities. Hudi offers Time Travel functionality for viewing 
historical data states, combined with Apache Doris's efficient querying 
capabilities, enabling quick analysis of data at any point in time for precise 
lineage and auditing.
-- Incremental data reading and analysis: Large-scale data analysis often faces 
challenges of large data volumes and frequent updates. Hudi supports 
incremental data reading, allowing users to process only the changed data 
without full data updates. Additionally, Apache Doris's Incremental Read 
feature enhances this process, significantly improving data processing and 
analysis efficiency.
-- Cross-data source federated queries: Many enterprises have complex data 
sources stored in different databases. Doris's Multi-Catalog feature supports 
automatic mapping and synchronization of various data sources, enabling 
federated queries across data sources. This greatly shortens the data flow path 
and enhances work efficiency for enterprises needing to retrieve and integrate 
data from multiple sources for analysis.
+- Data lineage and auditing: For industries with high requirements for data 
security and accuracy like finance and healthcare, data lineage and auditing 
are crucial functionalities. Hudi offers Time Travel functionality for viewing 
historical data states, combined with Apache Doris' efficient querying 
capabilities, enabling quick analysis of data at any point in time for precise 
lineage and auditing.
+- Incremental data reading and analysis: Large-scale data analysis often faces 
challenges of large data volumes and frequent updates. Hudi supports 
incremental data reading, allowing users to process only the changed data 
without full data updates. Additionally, Apache Doris' Incremental Read feature 
enhances this process, significantly improving data processing and analysis 
efficiency.
+- Cross-data source federated queries: Many enterprises have complex data 
sources stored in different databases. Doris' Multi-Catalog feature supports 
automatic mapping and synchronization of various data sources, enabling 
federated queries across data sources. This greatly shortens the data flow path 
and enhances work efficiency for enterprises needing to retrieve and integrate 
data from multiple sources for analysis.
 
 This article will introduce readers to how to quickly set up a test and 
demonstration environment for Apache Doris + Apache Hudi in a Docker 
environment, and demonstrate various operations to help readers get started 
quickly.
 
@@ -258,9 +258,9 @@ spark-sql> select * from customer_mor timestamp as of 
'20240603015058442' where
 
 Data in Apache Hudi can be roughly divided into two categories - baseline data 
and incremental data. Baseline data is typically merged Parquet files, while 
incremental data refers to data increments generated by INSERT, UPDATE, or 
DELETE operations. Baseline data can be read directly, while incremental data 
needs to be read through Merge on Read.
 
-For querying Hudi COW tables or Read Optimized queries on MOR tables, the data 
belongs to baseline data and can be directly read using Doris's native Parquet 
Reader, providing fast query responses. For incremental data, Doris needs to 
access Hudi's Java SDK through JNI calls. To achieve optimal query performance, 
Apache Doris divides the data in a query into baseline and incremental data 
parts and reads them using the aforementioned methods.
+For querying Hudi COW tables or Read Optimized queries on MOR tables, the data 
belongs to baseline data and can be directly read using Doris' native Parquet 
Reader, providing fast query responses. For incremental data, Doris needs to 
access Hudi's Java SDK through JNI calls. To achieve optimal query performance, 
Apache Doris divides the data in a query into baseline and incremental data 
parts and reads them using the aforementioned methods.
 
-To verify this optimization approach, we can use the EXPLAIN statement to see 
how many baseline and incremental data are present in a query example below. 
For a COW table, all 101 data shards are baseline data 
(`hudiNativeReadSplits=101/101`), so the COW table can be entirely read 
directly using Doris's Parquet Reader, resulting in the best query performance. 
For a ROW table, most data shards are baseline data 
(`hudiNativeReadSplits=100/101`), with one shard being incremental data, which 
[...]
+To verify this optimization approach, we can use the EXPLAIN statement to see 
how many baseline and incremental data are present in a query example below. 
For a COW table, all 101 data shards are baseline data 
(`hudiNativeReadSplits=101/101`), so the COW table can be entirely read 
directly using Doris' Parquet Reader, resulting in the best query performance. 
For a ROW table, most data shards are baseline data 
(`hudiNativeReadSplits=100/101`), with one shard being incremental data, which  
[...]
 
 ```
 -- COW table is read natively
diff --git 
a/versioned_docs/version-2.1/get-starting/quick-start/doris-paimon.md 
b/versioned_docs/version-2.1/get-starting/quick-start/doris-paimon.md
index cf1956c6e2d..27175dfe180 100644
--- a/versioned_docs/version-2.1/get-starting/quick-start/doris-paimon.md
+++ b/versioned_docs/version-2.1/get-starting/quick-start/doris-paimon.md
@@ -228,12 +228,12 @@ We conducted a simple test on the TPCDS 1000 dataset in 
Paimon (0.8) version, us
 
 ![](/images/quick-start/lakehouse-paimon-benchmark.PNG)
 
-From the test results, it can be seen that Doris's average query performance 
on the standard static test set is 3-5 times that of Trino. In the future, we 
will optimize the Deletion Vector to further improve query efficiency in real 
business scenarios.
+From the test results, it can be seen that Doris' average query performance on 
the standard static test set is 3-5 times that of Trino. In the future, we will 
optimize the Deletion Vector to further improve query efficiency in real 
business scenarios.
 
 ## Query Optimization
 
 For baseline data, after introducing the Primary Key Table Read Optimized 
feature in Apache Paimon version 0.6, the query engine can directly access the 
underlying Parquet/ORC files, significantly improving the reading efficiency of 
baseline data. For unmerged incremental data (data increments generated by 
INSERT, UPDATE, or DELETE), they can be read through Merge-on-Read. In 
addition, Paimon introduced the Deletion Vector feature in version 0.8, which 
further enhances the query engine's [...]
-Apache Doris supports reading Deletion Vector through native Reader and 
performing Merge on Read. We demonstrate the query methods for baseline data 
and incremental data in a query using Doris's EXPLAIN statement.
+Apache Doris supports reading Deletion Vector through native Reader and 
performing Merge on Read. We demonstrate the query methods for baseline data 
and incremental data in a query using Doris' EXPLAIN statement.
 
 ```
 mysql> explain verbose select * from customer where c_nationkey < 3;
diff --git a/versioned_docs/version-3.0/get-starting/quick-start/doris-hudi.md 
b/versioned_docs/version-3.0/get-starting/quick-start/doris-hudi.md
index f7426e3a3fc..b640c5e8f92 100644
--- a/versioned_docs/version-3.0/get-starting/quick-start/doris-hudi.md
+++ b/versioned_docs/version-3.0/get-starting/quick-start/doris-hudi.md
@@ -31,7 +31,7 @@ In recent versions, Apache Doris has deepened its integration 
with data lakes an
 
 - Since version 0.15, Apache Doris has introduced Hive and Iceberg external 
tables, exploring the capabilities of combining with Apache Iceberg for data 
lakes.
 - Starting from version 1.2, Apache Doris officially introduced the 
Multi-Catalog feature, enabling automatic metadata mapping and data access for 
various data sources, along with numerous performance optimizations for 
external data reading and query execution. It now fully possesses the ability 
to build a high-speed and user-friendly Lakehouse architecture.
-- In version 2.1, Apache Doris's Data Lakehouse architecture was significantly 
enhanced, improving the reading and writing capabilities of mainstream data 
lake formats (Hudi, Iceberg, Paimon, etc.), introducing compatibility with 
multiple SQL dialects, and seamless migration from existing systems to Apache 
Doris. For data science and large-scale data reading scenarios, Doris 
integrated the Arrow Flight high-speed reading interface, achieving a 100-fold 
increase in data transfer efficiency.
+- In version 2.1, Apache Doris' Data Lakehouse architecture was significantly 
enhanced, improving the reading and writing capabilities of mainstream data 
lake formats (Hudi, Iceberg, Paimon, etc.), introducing compatibility with 
multiple SQL dialects, and seamless migration from existing systems to Apache 
Doris. For data science and large-scale data reading scenarios, Doris 
integrated the Arrow Flight high-speed reading interface, achieving a 100-fold 
increase in data transfer efficiency.
 
 ![](/images/quick-start/lakehouse-arch.PNG)
 
@@ -46,12 +46,12 @@ Apache Doris has also enhanced its ability to read Apache 
Hudi data tables:
 - Supports Time Travel
 - Supports Incremental Read
 
-With Apache Doris's high-performance query execution and Apache Hudi's 
real-time data management capabilities, efficient, flexible, and cost-effective 
data querying and analysis can be achieved. It also provides robust data 
lineage, auditing, and incremental processing functionalities. The combination 
of Apache Doris and Apache Hudi has been validated and promoted in real 
business scenarios by multiple community users:
+With Apache Doris' high-performance query execution and Apache Hudi's 
real-time data management capabilities, efficient, flexible, and cost-effective 
data querying and analysis can be achieved. It also provides robust data 
lineage, auditing, and incremental processing functionalities. The combination 
of Apache Doris and Apache Hudi has been validated and promoted in real 
business scenarios by multiple community users:
 
 - Real-time data analysis and processing: Common scenarios such as real-time 
data updates and query analysis in industries like finance, advertising, and 
e-commerce require real-time data processing. Hudi enables real-time data 
updates and management while ensuring data consistency and reliability. Doris 
efficiently handles large-scale data query requests in real-time, meeting the 
demands of real-time data analysis and processing effectively when combined.
-- Data lineage and auditing: For industries with high requirements for data 
security and accuracy like finance and healthcare, data lineage and auditing 
are crucial functionalities. Hudi offers Time Travel functionality for viewing 
historical data states, combined with Apache Doris's efficient querying 
capabilities, enabling quick analysis of data at any point in time for precise 
lineage and auditing.
-- Incremental data reading and analysis: Large-scale data analysis often faces 
challenges of large data volumes and frequent updates. Hudi supports 
incremental data reading, allowing users to process only the changed data 
without full data updates. Additionally, Apache Doris's Incremental Read 
feature enhances this process, significantly improving data processing and 
analysis efficiency.
-- Cross-data source federated queries: Many enterprises have complex data 
sources stored in different databases. Doris's Multi-Catalog feature supports 
automatic mapping and synchronization of various data sources, enabling 
federated queries across data sources. This greatly shortens the data flow path 
and enhances work efficiency for enterprises needing to retrieve and integrate 
data from multiple sources for analysis.
+- Data lineage and auditing: For industries with high requirements for data 
security and accuracy like finance and healthcare, data lineage and auditing 
are crucial functionalities. Hudi offers Time Travel functionality for viewing 
historical data states, combined with Apache Doris' efficient querying 
capabilities, enabling quick analysis of data at any point in time for precise 
lineage and auditing.
+- Incremental data reading and analysis: Large-scale data analysis often faces 
challenges of large data volumes and frequent updates. Hudi supports 
incremental data reading, allowing users to process only the changed data 
without full data updates. Additionally, Apache Doris' Incremental Read feature 
enhances this process, significantly improving data processing and analysis 
efficiency.
+- Cross-data source federated queries: Many enterprises have complex data 
sources stored in different databases. Doris' Multi-Catalog feature supports 
automatic mapping and synchronization of various data sources, enabling 
federated queries across data sources. This greatly shortens the data flow path 
and enhances work efficiency for enterprises needing to retrieve and integrate 
data from multiple sources for analysis.
 
 This article will introduce readers to how to quickly set up a test and 
demonstration environment for Apache Doris + Apache Hudi in a Docker 
environment, and demonstrate various operations to help readers get started 
quickly.
 
@@ -258,9 +258,9 @@ spark-sql> select * from customer_mor timestamp as of 
'20240603015058442' where
 
 Data in Apache Hudi can be roughly divided into two categories - baseline data 
and incremental data. Baseline data is typically merged Parquet files, while 
incremental data refers to data increments generated by INSERT, UPDATE, or 
DELETE operations. Baseline data can be read directly, while incremental data 
needs to be read through Merge on Read.
 
-For querying Hudi COW tables or Read Optimized queries on MOR tables, the data 
belongs to baseline data and can be directly read using Doris's native Parquet 
Reader, providing fast query responses. For incremental data, Doris needs to 
access Hudi's Java SDK through JNI calls. To achieve optimal query performance, 
Apache Doris divides the data in a query into baseline and incremental data 
parts and reads them using the aforementioned methods.
+For querying Hudi COW tables or Read Optimized queries on MOR tables, the data 
belongs to baseline data and can be directly read using Doris' native Parquet 
Reader, providing fast query responses. For incremental data, Doris needs to 
access Hudi's Java SDK through JNI calls. To achieve optimal query performance, 
Apache Doris divides the data in a query into baseline and incremental data 
parts and reads them using the aforementioned methods.
 
-To verify this optimization approach, we can use the EXPLAIN statement to see 
how many baseline and incremental data are present in a query example below. 
For a COW table, all 101 data shards are baseline data 
(`hudiNativeReadSplits=101/101`), so the COW table can be entirely read 
directly using Doris's Parquet Reader, resulting in the best query performance. 
For a ROW table, most data shards are baseline data 
(`hudiNativeReadSplits=100/101`), with one shard being incremental data, which 
[...]
+To verify this optimization approach, we can use the EXPLAIN statement to see 
how many baseline and incremental data are present in a query example below. 
For a COW table, all 101 data shards are baseline data 
(`hudiNativeReadSplits=101/101`), so the COW table can be entirely read 
directly using Doris' Parquet Reader, resulting in the best query performance. 
For a ROW table, most data shards are baseline data 
(`hudiNativeReadSplits=100/101`), with one shard being incremental data, which  
[...]
 
 ```
 -- COW table is read natively
diff --git 
a/versioned_docs/version-3.0/get-starting/quick-start/doris-paimon.md 
b/versioned_docs/version-3.0/get-starting/quick-start/doris-paimon.md
index cf1956c6e2d..27175dfe180 100644
--- a/versioned_docs/version-3.0/get-starting/quick-start/doris-paimon.md
+++ b/versioned_docs/version-3.0/get-starting/quick-start/doris-paimon.md
@@ -228,12 +228,12 @@ We conducted a simple test on the TPCDS 1000 dataset in 
Paimon (0.8) version, us
 
 ![](/images/quick-start/lakehouse-paimon-benchmark.PNG)
 
-From the test results, it can be seen that Doris's average query performance 
on the standard static test set is 3-5 times that of Trino. In the future, we 
will optimize the Deletion Vector to further improve query efficiency in real 
business scenarios.
+From the test results, it can be seen that Doris' average query performance on 
the standard static test set is 3-5 times that of Trino. In the future, we will 
optimize the Deletion Vector to further improve query efficiency in real 
business scenarios.
 
 ## Query Optimization
 
 For baseline data, after introducing the Primary Key Table Read Optimized 
feature in Apache Paimon version 0.6, the query engine can directly access the 
underlying Parquet/ORC files, significantly improving the reading efficiency of 
baseline data. For unmerged incremental data (data increments generated by 
INSERT, UPDATE, or DELETE), they can be read through Merge-on-Read. In 
addition, Paimon introduced the Deletion Vector feature in version 0.8, which 
further enhances the query engine's [...]
-Apache Doris supports reading Deletion Vector through native Reader and 
performing Merge on Read. We demonstrate the query methods for baseline data 
and incremental data in a query using Doris's EXPLAIN statement.
+Apache Doris supports reading Deletion Vector through native Reader and 
performing Merge on Read. We demonstrate the query methods for baseline data 
and incremental data in a query using Doris' EXPLAIN statement.
 
 ```
 mysql> explain verbose select * from customer where c_nationkey < 3;


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@doris.apache.org
For additional commands, e-mail: commits-h...@doris.apache.org

(doris-website) branch master updated: [doc][doc] small fixes (#929)

Reply via email to