This is an automated email from the ASF dual-hosted git repository. luzhijing pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/doris-website.git
The following commit(s) were added to refs/heads/master by this push: new 501bb30c296 [doc][doc] small fixes (#929) 501bb30c296 is described below commit 501bb30c2964192696f7c6cae1e44e71ef25c6ba Author: Hu Yanjun <100749531+httpshir...@users.noreply.github.com> AuthorDate: Tue Jul 30 18:13:26 2024 +0800 [doc][doc] small fixes (#929) --- docs/get-starting/quick-start/doris-hudi.md | 14 +++++++------- docs/get-starting/quick-start/doris-paimon.md | 4 ++-- .../current/get-starting/quick-start/doris-paimon.md | 7 +++---- .../version-2.1/get-starting/quick-start/doris-paimon.md | 6 +++--- .../version-3.0/get-starting/quick-start/doris-paimon.md | 6 +++--- .../version-2.1/get-starting/quick-start/doris-hudi.md | 14 +++++++------- .../version-2.1/get-starting/quick-start/doris-paimon.md | 4 ++-- .../version-3.0/get-starting/quick-start/doris-hudi.md | 14 +++++++------- .../version-3.0/get-starting/quick-start/doris-paimon.md | 4 ++-- 9 files changed, 36 insertions(+), 37 deletions(-) diff --git a/docs/get-starting/quick-start/doris-hudi.md b/docs/get-starting/quick-start/doris-hudi.md index f7426e3a3fc..b640c5e8f92 100644 --- a/docs/get-starting/quick-start/doris-hudi.md +++ b/docs/get-starting/quick-start/doris-hudi.md @@ -31,7 +31,7 @@ In recent versions, Apache Doris has deepened its integration with data lakes an - Since version 0.15, Apache Doris has introduced Hive and Iceberg external tables, exploring the capabilities of combining with Apache Iceberg for data lakes. - Starting from version 1.2, Apache Doris officially introduced the Multi-Catalog feature, enabling automatic metadata mapping and data access for various data sources, along with numerous performance optimizations for external data reading and query execution. It now fully possesses the ability to build a high-speed and user-friendly Lakehouse architecture. -- In version 2.1, Apache Doris's Data Lakehouse architecture was significantly enhanced, improving the reading and writing capabilities of mainstream data lake formats (Hudi, Iceberg, Paimon, etc.), introducing compatibility with multiple SQL dialects, and seamless migration from existing systems to Apache Doris. For data science and large-scale data reading scenarios, Doris integrated the Arrow Flight high-speed reading interface, achieving a 100-fold increase in data transfer efficiency. +- In version 2.1, Apache Doris' Data Lakehouse architecture was significantly enhanced, improving the reading and writing capabilities of mainstream data lake formats (Hudi, Iceberg, Paimon, etc.), introducing compatibility with multiple SQL dialects, and seamless migration from existing systems to Apache Doris. For data science and large-scale data reading scenarios, Doris integrated the Arrow Flight high-speed reading interface, achieving a 100-fold increase in data transfer efficiency. ![](/images/quick-start/lakehouse-arch.PNG) @@ -46,12 +46,12 @@ Apache Doris has also enhanced its ability to read Apache Hudi data tables: - Supports Time Travel - Supports Incremental Read -With Apache Doris's high-performance query execution and Apache Hudi's real-time data management capabilities, efficient, flexible, and cost-effective data querying and analysis can be achieved. It also provides robust data lineage, auditing, and incremental processing functionalities. The combination of Apache Doris and Apache Hudi has been validated and promoted in real business scenarios by multiple community users: +With Apache Doris' high-performance query execution and Apache Hudi's real-time data management capabilities, efficient, flexible, and cost-effective data querying and analysis can be achieved. It also provides robust data lineage, auditing, and incremental processing functionalities. The combination of Apache Doris and Apache Hudi has been validated and promoted in real business scenarios by multiple community users: - Real-time data analysis and processing: Common scenarios such as real-time data updates and query analysis in industries like finance, advertising, and e-commerce require real-time data processing. Hudi enables real-time data updates and management while ensuring data consistency and reliability. Doris efficiently handles large-scale data query requests in real-time, meeting the demands of real-time data analysis and processing effectively when combined. -- Data lineage and auditing: For industries with high requirements for data security and accuracy like finance and healthcare, data lineage and auditing are crucial functionalities. Hudi offers Time Travel functionality for viewing historical data states, combined with Apache Doris's efficient querying capabilities, enabling quick analysis of data at any point in time for precise lineage and auditing. -- Incremental data reading and analysis: Large-scale data analysis often faces challenges of large data volumes and frequent updates. Hudi supports incremental data reading, allowing users to process only the changed data without full data updates. Additionally, Apache Doris's Incremental Read feature enhances this process, significantly improving data processing and analysis efficiency. -- Cross-data source federated queries: Many enterprises have complex data sources stored in different databases. Doris's Multi-Catalog feature supports automatic mapping and synchronization of various data sources, enabling federated queries across data sources. This greatly shortens the data flow path and enhances work efficiency for enterprises needing to retrieve and integrate data from multiple sources for analysis. +- Data lineage and auditing: For industries with high requirements for data security and accuracy like finance and healthcare, data lineage and auditing are crucial functionalities. Hudi offers Time Travel functionality for viewing historical data states, combined with Apache Doris' efficient querying capabilities, enabling quick analysis of data at any point in time for precise lineage and auditing. +- Incremental data reading and analysis: Large-scale data analysis often faces challenges of large data volumes and frequent updates. Hudi supports incremental data reading, allowing users to process only the changed data without full data updates. Additionally, Apache Doris' Incremental Read feature enhances this process, significantly improving data processing and analysis efficiency. +- Cross-data source federated queries: Many enterprises have complex data sources stored in different databases. Doris' Multi-Catalog feature supports automatic mapping and synchronization of various data sources, enabling federated queries across data sources. This greatly shortens the data flow path and enhances work efficiency for enterprises needing to retrieve and integrate data from multiple sources for analysis. This article will introduce readers to how to quickly set up a test and demonstration environment for Apache Doris + Apache Hudi in a Docker environment, and demonstrate various operations to help readers get started quickly. @@ -258,9 +258,9 @@ spark-sql> select * from customer_mor timestamp as of '20240603015058442' where Data in Apache Hudi can be roughly divided into two categories - baseline data and incremental data. Baseline data is typically merged Parquet files, while incremental data refers to data increments generated by INSERT, UPDATE, or DELETE operations. Baseline data can be read directly, while incremental data needs to be read through Merge on Read. -For querying Hudi COW tables or Read Optimized queries on MOR tables, the data belongs to baseline data and can be directly read using Doris's native Parquet Reader, providing fast query responses. For incremental data, Doris needs to access Hudi's Java SDK through JNI calls. To achieve optimal query performance, Apache Doris divides the data in a query into baseline and incremental data parts and reads them using the aforementioned methods. +For querying Hudi COW tables or Read Optimized queries on MOR tables, the data belongs to baseline data and can be directly read using Doris' native Parquet Reader, providing fast query responses. For incremental data, Doris needs to access Hudi's Java SDK through JNI calls. To achieve optimal query performance, Apache Doris divides the data in a query into baseline and incremental data parts and reads them using the aforementioned methods. -To verify this optimization approach, we can use the EXPLAIN statement to see how many baseline and incremental data are present in a query example below. For a COW table, all 101 data shards are baseline data (`hudiNativeReadSplits=101/101`), so the COW table can be entirely read directly using Doris's Parquet Reader, resulting in the best query performance. For a ROW table, most data shards are baseline data (`hudiNativeReadSplits=100/101`), with one shard being incremental data, which [...] +To verify this optimization approach, we can use the EXPLAIN statement to see how many baseline and incremental data are present in a query example below. For a COW table, all 101 data shards are baseline data (`hudiNativeReadSplits=101/101`), so the COW table can be entirely read directly using Doris' Parquet Reader, resulting in the best query performance. For a ROW table, most data shards are baseline data (`hudiNativeReadSplits=100/101`), with one shard being incremental data, which [...] ``` -- COW table is read natively diff --git a/docs/get-starting/quick-start/doris-paimon.md b/docs/get-starting/quick-start/doris-paimon.md index cf1956c6e2d..27175dfe180 100644 --- a/docs/get-starting/quick-start/doris-paimon.md +++ b/docs/get-starting/quick-start/doris-paimon.md @@ -228,12 +228,12 @@ We conducted a simple test on the TPCDS 1000 dataset in Paimon (0.8) version, us ![](/images/quick-start/lakehouse-paimon-benchmark.PNG) -From the test results, it can be seen that Doris's average query performance on the standard static test set is 3-5 times that of Trino. In the future, we will optimize the Deletion Vector to further improve query efficiency in real business scenarios. +From the test results, it can be seen that Doris' average query performance on the standard static test set is 3-5 times that of Trino. In the future, we will optimize the Deletion Vector to further improve query efficiency in real business scenarios. ## Query Optimization For baseline data, after introducing the Primary Key Table Read Optimized feature in Apache Paimon version 0.6, the query engine can directly access the underlying Parquet/ORC files, significantly improving the reading efficiency of baseline data. For unmerged incremental data (data increments generated by INSERT, UPDATE, or DELETE), they can be read through Merge-on-Read. In addition, Paimon introduced the Deletion Vector feature in version 0.8, which further enhances the query engine's [...] -Apache Doris supports reading Deletion Vector through native Reader and performing Merge on Read. We demonstrate the query methods for baseline data and incremental data in a query using Doris's EXPLAIN statement. +Apache Doris supports reading Deletion Vector through native Reader and performing Merge on Read. We demonstrate the query methods for baseline data and incremental data in a query using Doris' EXPLAIN statement. ``` mysql> explain verbose select * from customer where c_nationkey < 3; diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/current/get-starting/quick-start/doris-paimon.md b/i18n/zh-CN/docusaurus-plugin-content-docs/current/get-starting/quick-start/doris-paimon.md index 68d6fba0c90..bdf306d6fa4 100644 --- a/i18n/zh-CN/docusaurus-plugin-content-docs/current/get-starting/quick-start/doris-paimon.md +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/current/get-starting/quick-start/doris-paimon.md @@ -156,7 +156,7 @@ Flink SQL> select * from customer order by c_custkey limit 4; ### 04 数据查询 -如下所示,Doris 集群中已经创建了名为paimon 的 Catalog(可通过 SHOW CATALOGS 查看)。以下为该 Catalog 的创建语句: +如下所示,Doris 集群中已经创建了名为 `paimon` 的 Catalog(可通过 SHOW CATALOGS 查看)。以下为该 Catalog 的创建语句: ``` -- 已创建,无需执行 @@ -228,7 +228,7 @@ mysql> select * from customer where c_nationkey=1 limit 2; ![](/images/quick-start/lakehouse-paimon-benchmark.PNG) -从测试结果可以看到,Doris 在标准静态测试集上的平均查询性能是 Trino 的 3 -5 倍。后续我们将针对 Deletion Vector 进行优化,进一步提升真实业务场景下的查询效率。 +从测试结果可以看到,Doris 在标准静态测试集上的平均查询性能是 Trino 的 3~5 倍。后续我们将针对 Deletion Vector 进行优化,进一步提升真实业务场景下的查询效率。 ## 查询优化 @@ -266,5 +266,4 @@ mysql> explain verbose select * from customer where c_nationkey < 3; 67 rows in set (0.23 sec) ``` -可以看到,对于刚才通过 Flink SQL 更新的表,包含 4 个分片,并且全部分片都可以通过 Native Reader 进行访问(paimonNativeReadSplits=4/4)。并且第一个分片的hasDeletionVector的属性为 true,表示该分片有对应的 Deletion Vector,读取时会根据 Deletion Vector 进行数据过滤。 - +可以看到,对于刚才通过 Flink SQL 更新的表,包含 4 个分片,并且全部分片都可以通过 Native Reader 进行访问(`paimonNativeReadSplits=4/4`)。并且第一个分片的 `hasDeletionVector` 的属性为 `true`,表示该分片有对应的 Deletion Vector,读取时会根据 Deletion Vector 进行数据过滤。 diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/version-2.1/get-starting/quick-start/doris-paimon.md b/i18n/zh-CN/docusaurus-plugin-content-docs/version-2.1/get-starting/quick-start/doris-paimon.md index c46e7531e19..a1db993d470 100644 --- a/i18n/zh-CN/docusaurus-plugin-content-docs/version-2.1/get-starting/quick-start/doris-paimon.md +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/version-2.1/get-starting/quick-start/doris-paimon.md @@ -156,7 +156,7 @@ Flink SQL> select * from customer order by c_custkey limit 4; ### 04 数据查询 -如下所示,Doris 集群中已经创建了名为paimon 的 Catalog(可通过 SHOW CATALOGS 查看)。以下为该 Catalog 的创建语句: +如下所示,Doris 集群中已经创建了名为 `paimon` 的 Catalog(可通过 SHOW CATALOGS 查看)。以下为该 Catalog 的创建语句: ``` -- 已创建,无需执行 @@ -228,7 +228,7 @@ mysql> select * from customer where c_nationkey=1 limit 2; ![](/images/quick-start/lakehouse-paimon-benchmark.PNG) -从测试结果可以看到,Doris 在标准静态测试集上的平均查询性能是 Trino 的 3 -5 倍。后续我们将针对 Deletion Vector 进行优化,进一步提升真实业务场景下的查询效率。 +从测试结果可以看到,Doris 在标准静态测试集上的平均查询性能是 Trino 的 3~5 倍。后续我们将针对 Deletion Vector 进行优化,进一步提升真实业务场景下的查询效率。 ## 查询优化 @@ -266,4 +266,4 @@ mysql> explain verbose select * from customer where c_nationkey < 3; 67 rows in set (0.23 sec) ``` -可以看到,对于刚才通过 Flink SQL 更新的表,包含 4 个分片,并且全部分片都可以通过 Native Reader 进行访问(paimonNativeReadSplits=4/4)。并且第一个分片的hasDeletionVector的属性为 true,表示该分片有对应的 Deletion Vector,读取时会根据 Deletion Vector 进行数据过滤。 \ No newline at end of file +可以看到,对于刚才通过 Flink SQL 更新的表,包含 4 个分片,并且全部分片都可以通过 Native Reader 进行访问(`paimonNativeReadSplits=4/4`)。并且第一个分片的`hasDeletionVector`的属性为`true`,表示该分片有对应的 Deletion Vector,读取时会根据 Deletion Vector 进行数据过滤。 \ No newline at end of file diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/version-3.0/get-starting/quick-start/doris-paimon.md b/i18n/zh-CN/docusaurus-plugin-content-docs/version-3.0/get-starting/quick-start/doris-paimon.md index c46e7531e19..a1db993d470 100644 --- a/i18n/zh-CN/docusaurus-plugin-content-docs/version-3.0/get-starting/quick-start/doris-paimon.md +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/version-3.0/get-starting/quick-start/doris-paimon.md @@ -156,7 +156,7 @@ Flink SQL> select * from customer order by c_custkey limit 4; ### 04 数据查询 -如下所示,Doris 集群中已经创建了名为paimon 的 Catalog(可通过 SHOW CATALOGS 查看)。以下为该 Catalog 的创建语句: +如下所示,Doris 集群中已经创建了名为 `paimon` 的 Catalog(可通过 SHOW CATALOGS 查看)。以下为该 Catalog 的创建语句: ``` -- 已创建,无需执行 @@ -228,7 +228,7 @@ mysql> select * from customer where c_nationkey=1 limit 2; ![](/images/quick-start/lakehouse-paimon-benchmark.PNG) -从测试结果可以看到,Doris 在标准静态测试集上的平均查询性能是 Trino 的 3 -5 倍。后续我们将针对 Deletion Vector 进行优化,进一步提升真实业务场景下的查询效率。 +从测试结果可以看到,Doris 在标准静态测试集上的平均查询性能是 Trino 的 3~5 倍。后续我们将针对 Deletion Vector 进行优化,进一步提升真实业务场景下的查询效率。 ## 查询优化 @@ -266,4 +266,4 @@ mysql> explain verbose select * from customer where c_nationkey < 3; 67 rows in set (0.23 sec) ``` -可以看到,对于刚才通过 Flink SQL 更新的表,包含 4 个分片,并且全部分片都可以通过 Native Reader 进行访问(paimonNativeReadSplits=4/4)。并且第一个分片的hasDeletionVector的属性为 true,表示该分片有对应的 Deletion Vector,读取时会根据 Deletion Vector 进行数据过滤。 \ No newline at end of file +可以看到,对于刚才通过 Flink SQL 更新的表,包含 4 个分片,并且全部分片都可以通过 Native Reader 进行访问(`paimonNativeReadSplits=4/4`)。并且第一个分片的`hasDeletionVector`的属性为`true`,表示该分片有对应的 Deletion Vector,读取时会根据 Deletion Vector 进行数据过滤。 \ No newline at end of file diff --git a/versioned_docs/version-2.1/get-starting/quick-start/doris-hudi.md b/versioned_docs/version-2.1/get-starting/quick-start/doris-hudi.md index f7426e3a3fc..b640c5e8f92 100644 --- a/versioned_docs/version-2.1/get-starting/quick-start/doris-hudi.md +++ b/versioned_docs/version-2.1/get-starting/quick-start/doris-hudi.md @@ -31,7 +31,7 @@ In recent versions, Apache Doris has deepened its integration with data lakes an - Since version 0.15, Apache Doris has introduced Hive and Iceberg external tables, exploring the capabilities of combining with Apache Iceberg for data lakes. - Starting from version 1.2, Apache Doris officially introduced the Multi-Catalog feature, enabling automatic metadata mapping and data access for various data sources, along with numerous performance optimizations for external data reading and query execution. It now fully possesses the ability to build a high-speed and user-friendly Lakehouse architecture. -- In version 2.1, Apache Doris's Data Lakehouse architecture was significantly enhanced, improving the reading and writing capabilities of mainstream data lake formats (Hudi, Iceberg, Paimon, etc.), introducing compatibility with multiple SQL dialects, and seamless migration from existing systems to Apache Doris. For data science and large-scale data reading scenarios, Doris integrated the Arrow Flight high-speed reading interface, achieving a 100-fold increase in data transfer efficiency. +- In version 2.1, Apache Doris' Data Lakehouse architecture was significantly enhanced, improving the reading and writing capabilities of mainstream data lake formats (Hudi, Iceberg, Paimon, etc.), introducing compatibility with multiple SQL dialects, and seamless migration from existing systems to Apache Doris. For data science and large-scale data reading scenarios, Doris integrated the Arrow Flight high-speed reading interface, achieving a 100-fold increase in data transfer efficiency. ![](/images/quick-start/lakehouse-arch.PNG) @@ -46,12 +46,12 @@ Apache Doris has also enhanced its ability to read Apache Hudi data tables: - Supports Time Travel - Supports Incremental Read -With Apache Doris's high-performance query execution and Apache Hudi's real-time data management capabilities, efficient, flexible, and cost-effective data querying and analysis can be achieved. It also provides robust data lineage, auditing, and incremental processing functionalities. The combination of Apache Doris and Apache Hudi has been validated and promoted in real business scenarios by multiple community users: +With Apache Doris' high-performance query execution and Apache Hudi's real-time data management capabilities, efficient, flexible, and cost-effective data querying and analysis can be achieved. It also provides robust data lineage, auditing, and incremental processing functionalities. The combination of Apache Doris and Apache Hudi has been validated and promoted in real business scenarios by multiple community users: - Real-time data analysis and processing: Common scenarios such as real-time data updates and query analysis in industries like finance, advertising, and e-commerce require real-time data processing. Hudi enables real-time data updates and management while ensuring data consistency and reliability. Doris efficiently handles large-scale data query requests in real-time, meeting the demands of real-time data analysis and processing effectively when combined. -- Data lineage and auditing: For industries with high requirements for data security and accuracy like finance and healthcare, data lineage and auditing are crucial functionalities. Hudi offers Time Travel functionality for viewing historical data states, combined with Apache Doris's efficient querying capabilities, enabling quick analysis of data at any point in time for precise lineage and auditing. -- Incremental data reading and analysis: Large-scale data analysis often faces challenges of large data volumes and frequent updates. Hudi supports incremental data reading, allowing users to process only the changed data without full data updates. Additionally, Apache Doris's Incremental Read feature enhances this process, significantly improving data processing and analysis efficiency. -- Cross-data source federated queries: Many enterprises have complex data sources stored in different databases. Doris's Multi-Catalog feature supports automatic mapping and synchronization of various data sources, enabling federated queries across data sources. This greatly shortens the data flow path and enhances work efficiency for enterprises needing to retrieve and integrate data from multiple sources for analysis. +- Data lineage and auditing: For industries with high requirements for data security and accuracy like finance and healthcare, data lineage and auditing are crucial functionalities. Hudi offers Time Travel functionality for viewing historical data states, combined with Apache Doris' efficient querying capabilities, enabling quick analysis of data at any point in time for precise lineage and auditing. +- Incremental data reading and analysis: Large-scale data analysis often faces challenges of large data volumes and frequent updates. Hudi supports incremental data reading, allowing users to process only the changed data without full data updates. Additionally, Apache Doris' Incremental Read feature enhances this process, significantly improving data processing and analysis efficiency. +- Cross-data source federated queries: Many enterprises have complex data sources stored in different databases. Doris' Multi-Catalog feature supports automatic mapping and synchronization of various data sources, enabling federated queries across data sources. This greatly shortens the data flow path and enhances work efficiency for enterprises needing to retrieve and integrate data from multiple sources for analysis. This article will introduce readers to how to quickly set up a test and demonstration environment for Apache Doris + Apache Hudi in a Docker environment, and demonstrate various operations to help readers get started quickly. @@ -258,9 +258,9 @@ spark-sql> select * from customer_mor timestamp as of '20240603015058442' where Data in Apache Hudi can be roughly divided into two categories - baseline data and incremental data. Baseline data is typically merged Parquet files, while incremental data refers to data increments generated by INSERT, UPDATE, or DELETE operations. Baseline data can be read directly, while incremental data needs to be read through Merge on Read. -For querying Hudi COW tables or Read Optimized queries on MOR tables, the data belongs to baseline data and can be directly read using Doris's native Parquet Reader, providing fast query responses. For incremental data, Doris needs to access Hudi's Java SDK through JNI calls. To achieve optimal query performance, Apache Doris divides the data in a query into baseline and incremental data parts and reads them using the aforementioned methods. +For querying Hudi COW tables or Read Optimized queries on MOR tables, the data belongs to baseline data and can be directly read using Doris' native Parquet Reader, providing fast query responses. For incremental data, Doris needs to access Hudi's Java SDK through JNI calls. To achieve optimal query performance, Apache Doris divides the data in a query into baseline and incremental data parts and reads them using the aforementioned methods. -To verify this optimization approach, we can use the EXPLAIN statement to see how many baseline and incremental data are present in a query example below. For a COW table, all 101 data shards are baseline data (`hudiNativeReadSplits=101/101`), so the COW table can be entirely read directly using Doris's Parquet Reader, resulting in the best query performance. For a ROW table, most data shards are baseline data (`hudiNativeReadSplits=100/101`), with one shard being incremental data, which [...] +To verify this optimization approach, we can use the EXPLAIN statement to see how many baseline and incremental data are present in a query example below. For a COW table, all 101 data shards are baseline data (`hudiNativeReadSplits=101/101`), so the COW table can be entirely read directly using Doris' Parquet Reader, resulting in the best query performance. For a ROW table, most data shards are baseline data (`hudiNativeReadSplits=100/101`), with one shard being incremental data, which [...] ``` -- COW table is read natively diff --git a/versioned_docs/version-2.1/get-starting/quick-start/doris-paimon.md b/versioned_docs/version-2.1/get-starting/quick-start/doris-paimon.md index cf1956c6e2d..27175dfe180 100644 --- a/versioned_docs/version-2.1/get-starting/quick-start/doris-paimon.md +++ b/versioned_docs/version-2.1/get-starting/quick-start/doris-paimon.md @@ -228,12 +228,12 @@ We conducted a simple test on the TPCDS 1000 dataset in Paimon (0.8) version, us ![](/images/quick-start/lakehouse-paimon-benchmark.PNG) -From the test results, it can be seen that Doris's average query performance on the standard static test set is 3-5 times that of Trino. In the future, we will optimize the Deletion Vector to further improve query efficiency in real business scenarios. +From the test results, it can be seen that Doris' average query performance on the standard static test set is 3-5 times that of Trino. In the future, we will optimize the Deletion Vector to further improve query efficiency in real business scenarios. ## Query Optimization For baseline data, after introducing the Primary Key Table Read Optimized feature in Apache Paimon version 0.6, the query engine can directly access the underlying Parquet/ORC files, significantly improving the reading efficiency of baseline data. For unmerged incremental data (data increments generated by INSERT, UPDATE, or DELETE), they can be read through Merge-on-Read. In addition, Paimon introduced the Deletion Vector feature in version 0.8, which further enhances the query engine's [...] -Apache Doris supports reading Deletion Vector through native Reader and performing Merge on Read. We demonstrate the query methods for baseline data and incremental data in a query using Doris's EXPLAIN statement. +Apache Doris supports reading Deletion Vector through native Reader and performing Merge on Read. We demonstrate the query methods for baseline data and incremental data in a query using Doris' EXPLAIN statement. ``` mysql> explain verbose select * from customer where c_nationkey < 3; diff --git a/versioned_docs/version-3.0/get-starting/quick-start/doris-hudi.md b/versioned_docs/version-3.0/get-starting/quick-start/doris-hudi.md index f7426e3a3fc..b640c5e8f92 100644 --- a/versioned_docs/version-3.0/get-starting/quick-start/doris-hudi.md +++ b/versioned_docs/version-3.0/get-starting/quick-start/doris-hudi.md @@ -31,7 +31,7 @@ In recent versions, Apache Doris has deepened its integration with data lakes an - Since version 0.15, Apache Doris has introduced Hive and Iceberg external tables, exploring the capabilities of combining with Apache Iceberg for data lakes. - Starting from version 1.2, Apache Doris officially introduced the Multi-Catalog feature, enabling automatic metadata mapping and data access for various data sources, along with numerous performance optimizations for external data reading and query execution. It now fully possesses the ability to build a high-speed and user-friendly Lakehouse architecture. -- In version 2.1, Apache Doris's Data Lakehouse architecture was significantly enhanced, improving the reading and writing capabilities of mainstream data lake formats (Hudi, Iceberg, Paimon, etc.), introducing compatibility with multiple SQL dialects, and seamless migration from existing systems to Apache Doris. For data science and large-scale data reading scenarios, Doris integrated the Arrow Flight high-speed reading interface, achieving a 100-fold increase in data transfer efficiency. +- In version 2.1, Apache Doris' Data Lakehouse architecture was significantly enhanced, improving the reading and writing capabilities of mainstream data lake formats (Hudi, Iceberg, Paimon, etc.), introducing compatibility with multiple SQL dialects, and seamless migration from existing systems to Apache Doris. For data science and large-scale data reading scenarios, Doris integrated the Arrow Flight high-speed reading interface, achieving a 100-fold increase in data transfer efficiency. ![](/images/quick-start/lakehouse-arch.PNG) @@ -46,12 +46,12 @@ Apache Doris has also enhanced its ability to read Apache Hudi data tables: - Supports Time Travel - Supports Incremental Read -With Apache Doris's high-performance query execution and Apache Hudi's real-time data management capabilities, efficient, flexible, and cost-effective data querying and analysis can be achieved. It also provides robust data lineage, auditing, and incremental processing functionalities. The combination of Apache Doris and Apache Hudi has been validated and promoted in real business scenarios by multiple community users: +With Apache Doris' high-performance query execution and Apache Hudi's real-time data management capabilities, efficient, flexible, and cost-effective data querying and analysis can be achieved. It also provides robust data lineage, auditing, and incremental processing functionalities. The combination of Apache Doris and Apache Hudi has been validated and promoted in real business scenarios by multiple community users: - Real-time data analysis and processing: Common scenarios such as real-time data updates and query analysis in industries like finance, advertising, and e-commerce require real-time data processing. Hudi enables real-time data updates and management while ensuring data consistency and reliability. Doris efficiently handles large-scale data query requests in real-time, meeting the demands of real-time data analysis and processing effectively when combined. -- Data lineage and auditing: For industries with high requirements for data security and accuracy like finance and healthcare, data lineage and auditing are crucial functionalities. Hudi offers Time Travel functionality for viewing historical data states, combined with Apache Doris's efficient querying capabilities, enabling quick analysis of data at any point in time for precise lineage and auditing. -- Incremental data reading and analysis: Large-scale data analysis often faces challenges of large data volumes and frequent updates. Hudi supports incremental data reading, allowing users to process only the changed data without full data updates. Additionally, Apache Doris's Incremental Read feature enhances this process, significantly improving data processing and analysis efficiency. -- Cross-data source federated queries: Many enterprises have complex data sources stored in different databases. Doris's Multi-Catalog feature supports automatic mapping and synchronization of various data sources, enabling federated queries across data sources. This greatly shortens the data flow path and enhances work efficiency for enterprises needing to retrieve and integrate data from multiple sources for analysis. +- Data lineage and auditing: For industries with high requirements for data security and accuracy like finance and healthcare, data lineage and auditing are crucial functionalities. Hudi offers Time Travel functionality for viewing historical data states, combined with Apache Doris' efficient querying capabilities, enabling quick analysis of data at any point in time for precise lineage and auditing. +- Incremental data reading and analysis: Large-scale data analysis often faces challenges of large data volumes and frequent updates. Hudi supports incremental data reading, allowing users to process only the changed data without full data updates. Additionally, Apache Doris' Incremental Read feature enhances this process, significantly improving data processing and analysis efficiency. +- Cross-data source federated queries: Many enterprises have complex data sources stored in different databases. Doris' Multi-Catalog feature supports automatic mapping and synchronization of various data sources, enabling federated queries across data sources. This greatly shortens the data flow path and enhances work efficiency for enterprises needing to retrieve and integrate data from multiple sources for analysis. This article will introduce readers to how to quickly set up a test and demonstration environment for Apache Doris + Apache Hudi in a Docker environment, and demonstrate various operations to help readers get started quickly. @@ -258,9 +258,9 @@ spark-sql> select * from customer_mor timestamp as of '20240603015058442' where Data in Apache Hudi can be roughly divided into two categories - baseline data and incremental data. Baseline data is typically merged Parquet files, while incremental data refers to data increments generated by INSERT, UPDATE, or DELETE operations. Baseline data can be read directly, while incremental data needs to be read through Merge on Read. -For querying Hudi COW tables or Read Optimized queries on MOR tables, the data belongs to baseline data and can be directly read using Doris's native Parquet Reader, providing fast query responses. For incremental data, Doris needs to access Hudi's Java SDK through JNI calls. To achieve optimal query performance, Apache Doris divides the data in a query into baseline and incremental data parts and reads them using the aforementioned methods. +For querying Hudi COW tables or Read Optimized queries on MOR tables, the data belongs to baseline data and can be directly read using Doris' native Parquet Reader, providing fast query responses. For incremental data, Doris needs to access Hudi's Java SDK through JNI calls. To achieve optimal query performance, Apache Doris divides the data in a query into baseline and incremental data parts and reads them using the aforementioned methods. -To verify this optimization approach, we can use the EXPLAIN statement to see how many baseline and incremental data are present in a query example below. For a COW table, all 101 data shards are baseline data (`hudiNativeReadSplits=101/101`), so the COW table can be entirely read directly using Doris's Parquet Reader, resulting in the best query performance. For a ROW table, most data shards are baseline data (`hudiNativeReadSplits=100/101`), with one shard being incremental data, which [...] +To verify this optimization approach, we can use the EXPLAIN statement to see how many baseline and incremental data are present in a query example below. For a COW table, all 101 data shards are baseline data (`hudiNativeReadSplits=101/101`), so the COW table can be entirely read directly using Doris' Parquet Reader, resulting in the best query performance. For a ROW table, most data shards are baseline data (`hudiNativeReadSplits=100/101`), with one shard being incremental data, which [...] ``` -- COW table is read natively diff --git a/versioned_docs/version-3.0/get-starting/quick-start/doris-paimon.md b/versioned_docs/version-3.0/get-starting/quick-start/doris-paimon.md index cf1956c6e2d..27175dfe180 100644 --- a/versioned_docs/version-3.0/get-starting/quick-start/doris-paimon.md +++ b/versioned_docs/version-3.0/get-starting/quick-start/doris-paimon.md @@ -228,12 +228,12 @@ We conducted a simple test on the TPCDS 1000 dataset in Paimon (0.8) version, us ![](/images/quick-start/lakehouse-paimon-benchmark.PNG) -From the test results, it can be seen that Doris's average query performance on the standard static test set is 3-5 times that of Trino. In the future, we will optimize the Deletion Vector to further improve query efficiency in real business scenarios. +From the test results, it can be seen that Doris' average query performance on the standard static test set is 3-5 times that of Trino. In the future, we will optimize the Deletion Vector to further improve query efficiency in real business scenarios. ## Query Optimization For baseline data, after introducing the Primary Key Table Read Optimized feature in Apache Paimon version 0.6, the query engine can directly access the underlying Parquet/ORC files, significantly improving the reading efficiency of baseline data. For unmerged incremental data (data increments generated by INSERT, UPDATE, or DELETE), they can be read through Merge-on-Read. In addition, Paimon introduced the Deletion Vector feature in version 0.8, which further enhances the query engine's [...] -Apache Doris supports reading Deletion Vector through native Reader and performing Merge on Read. We demonstrate the query methods for baseline data and incremental data in a query using Doris's EXPLAIN statement. +Apache Doris supports reading Deletion Vector through native Reader and performing Merge on Read. We demonstrate the query methods for baseline data and incremental data in a query using Doris' EXPLAIN statement. ``` mysql> explain verbose select * from customer where c_nationkey < 3; --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@doris.apache.org For additional commands, e-mail: commits-h...@doris.apache.org