(doris-website) branch master updated: [blog](update) Update Blog : How Inverted Index Accelerates Text Searches by 40 Times (#400)

luzhijing Wed, 31 Jan 2024 22:32:58 -0800

This is an automated email from the ASF dual-hosted git repository.

luzhijing pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/doris-website.git



The following commit(s) were added to refs/heads/master by this push:
     new 5d8477b2520 [blog](update) Update Blog : How Inverted Index 
Accelerates Text Searches by 40 Times (#400)
5d8477b2520 is described below

commit 5d8477b252002669da6ca8c4c5a63e8e79216bf6
Author: KassieZ <[email protected]>
AuthorDate: Thu Feb 1 14:32:48 2024 +0800

    [blog](update) Update Blog : How Inverted Index Accelerates Text Searches 
by 40 Times (#400)
---
 ...al-time-data-warehouse-based-on-apache-doris.md |   4 +-
 ...ta-reporting-tagging-and-data-lake-analytics.md |   2 +-
 ...expect-from-apache-doris-as-a-data-warehouse.md |   2 -
 ...erates-text-searches-by-40-time-apache-doris.md | 505 +++++++++++++++++++++
 blog/release-note-2.0.4.md                         |   6 +-
 src/components/recent-blogs/recent-blogs.data.ts   |   2 +-
 src/constant/newsletter.data.ts                    |  23 +-
 .../data-storage-structure-in-apache-doris.png     | Bin 0 -> 92143 bytes
 ...index-accelerates-text-searches-by-40-times.png | Bin 0 -> 186259 bytes
 static/images/illustration-of-inverted-index.png   | Bin 0 -> 116559 bytes
 .../images/illustration-of-ngram-bloomfilter.png   | Bin 0 -> 138100 bytes
 static/images/sample-dataset.png                   | Bin 0 -> 209350 bytes
 12 files changed, 527 insertions(+), 17 deletions(-)

diff --git 
a/blog/a-fast-secure-high-available-real-time-data-warehouse-based-on-apache-doris.md
 
b/blog/a-fast-secure-high-available-real-time-data-warehouse-based-on-apache-doris.md
index 9d1f66a63c2..8a8b0d7c8c2 100644
--- 
a/blog/a-fast-secure-high-available-real-time-data-warehouse-based-on-apache-doris.md
+++ 
b/blog/a-fast-secure-high-available-real-time-data-warehouse-based-on-apache-doris.md
@@ -1,12 +1,12 @@
 ---
 {
-    'title': "The financial sector's choice: fast, secure, and highly 
available real-time data warehousing based on Apache Doris",
+    'title': "Financial data warehousing: fast, secure, and highly available 
with Apache Doris",
     'summary': "A whole-journey guide for financial users looking for fast 
data processing performance, data security, and high service availability.",
     'date': '2024-01-08',
     'author': 'Apache Doris',
     'tags': ['Best Practice'],
     'picked': "true",
-    'order': "2",
+    'order': "3",
     "image": 
'/images/apache-doris-a-fast-secure-and-highly-available-real-time-data-warehouse.png'
 }
 
diff --git 
a/blog/apache-doris-speeds-up-data-reporting-tagging-and-data-lake-analytics.md 
b/blog/apache-doris-speeds-up-data-reporting-tagging-and-data-lake-analytics.md
index 7e58434d80b..3baf2b75bc2 100644
--- 
a/blog/apache-doris-speeds-up-data-reporting-tagging-and-data-lake-analytics.md
+++ 
b/blog/apache-doris-speeds-up-data-reporting-tagging-and-data-lake-analytics.md
@@ -6,7 +6,7 @@
     'author': 'Apache Doris',
     'tags': ['Best Practice'],
     'picked': "true",
-    'order': "3",
+    'order': "4",
     "image": 
'/images/apache-doris-speeds-up-data-reporting-data-lake-analytics.jpg'
 }
 
diff --git 
a/blog/apache-doris-summit-asia-2023-what-can-you-expect-from-apache-doris-as-a-data-warehouse.md
 
b/blog/apache-doris-summit-asia-2023-what-can-you-expect-from-apache-doris-as-a-data-warehouse.md
index 1d60c27f704..89ae8c16ec7 100644
--- 
a/blog/apache-doris-summit-asia-2023-what-can-you-expect-from-apache-doris-as-a-data-warehouse.md
+++ 
b/blog/apache-doris-summit-asia-2023-what-can-you-expect-from-apache-doris-as-a-data-warehouse.md
@@ -5,8 +5,6 @@
     'date': '2023-11-10',
     'author': 'Apache Doris',
     'tags': ['Top News'],
-    'picked': "true",
-    'order': "4",
     "image": '/images/doris-summit-asia.png'
 }
 
diff --git 
a/blog/inverted-index-accelerates-text-searches-by-40-time-apache-doris.md 
b/blog/inverted-index-accelerates-text-searches-by-40-time-apache-doris.md
new file mode 100644
index 00000000000..7c001930895
--- /dev/null
+++ b/blog/inverted-index-accelerates-text-searches-by-40-time-apache-doris.md
@@ -0,0 +1,505 @@
+---
+{
+    'title': 'A deep dive into inverted index: how it speeds up text searches 
by 40 times',
+    'summary': "As an open-source real-time data warehouse, Apache Doris 
provides a rich choice of indexes to speed up data scanning and filtering.This 
post is a deep dive into inverted index and NGram BloomFilter index, providing 
a hands-on guide to applying them for various queries.",
+    'date': '2024-02-01',
+    'author': 'Apache Doris',
+    'tags': ['Tech Sharing'],
+    'picked': "true",
+    'order': "1",
+    "image": 
'/images/how-inverted-index-accelerates-text-searches-by-40-times.png'
+}
+---
+
+<!-- 
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+  http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+
+As an open-source real-time data warehouse, [Apache 
Doris](https://doris.apache.org) provides a rich choice of indexes to speed up 
data scanning and filtering. Based on user involvement, they can be divided 
into built-in smart indexes and user-created indexes. The former is 
automatically generated by Apache Doris on data ingestion, such as ZoneMap 
index and prefix index, while the latter is the index users choose for various 
use cases, including inverted index and NGram BloomFilter index.
+
+This post is a deep dive into inverted index and NGram BloomFilter index, 
providing a hands-on guide to applying them for various queries.
+
+## Sample dataset
+
+The test dataset comprises about 130 million Amazon customer reviews. It is a 
few Snappy-compressed Parquet files with a total size of 37GB. These are a few 
samples:
+
+![img](../static/images/sample-dataset.png)
+
+Each row includes 15 columns including `customer_id`, `review_id`, 
`product_id`, `product_category`, `star_rating`, `review_headline`, and 
`review_body`. 
+
+A lot of these columns can be accelerated by indexes based on their 
structures. For example, `customer_id` is a high-cardinality numerical field 
while `product_id` is a low-cardinality fixed-length text field, and 
`product_title` and `review_body` are short and long text fields, respectively.
+
+Queries on these columns can be roughly divided into two types:
+
+- **Text searches**: searches for certain contents in the `review_body` field.
+- **Non-primary key column queries**: query reviews about certain `product_id` 
or from certain `customer_id`.
+
+These are also the main threads of this article. I will present to you how 
indexes can speed up these queries.
+
+## Prerequisites
+
+For a quick run, here we use a single-node cluster (1 frontend, 1 backend).
+
+1. Deploy Apache Doris: refer to [Quick 
Start](https://doris.apache.org/docs/get-starting/quick-start/)
+2. Create a table using the following statements: 
+
+```SQL
+CREATE TABLE `amazon_reviews` (  
+  `review_date` int(11) NULL,  
+  `marketplace` varchar(20) NULL,  
+  `customer_id` bigint(20) NULL,  
+  `review_id` varchar(40) NULL,
+  `product_id` varchar(10) NULL,
+  `product_parent` bigint(20) NULL,
+  `product_title` varchar(500) NULL,
+  `product_category` varchar(50) NULL,
+  `star_rating` smallint(6) NULL,
+  `helpful_votes` int(11) NULL,
+  `total_votes` int(11) NULL,
+  `vine` boolean NULL,
+  `verified_purchase` boolean NULL,
+  `review_headline` varchar(500) NULL,
+  `review_body` string NULL
+) ENGINE=OLAP
+DUPLICATE KEY(`review_date`)
+COMMENT 'OLAP'
+DISTRIBUTED BY HASH(`review_date`) BUCKETS 16
+PROPERTIES (
+"replication_allocation" = "tag.location.default: 1",
+"compression" = "ZSTD"
+);
+```
+
+3.  Download datasets: Snappy-compressed Parquet files with a total size of 
37GB
+   - 
[amazon_reviews_2010](https://datasets-documentation.s3.eu-west-3.amazonaws.com/amazon_reviews/amazon_reviews_2010.snappy.parquet)
+   - 
[amazon_reviews_2011](https://datasets-documentation.s3.eu-west-3.amazonaws.com/amazon_reviews/amazon_reviews_2011.snappy.parquet)
+   - 
[amazon_reviews_2012](https://datasets-documentation.s3.eu-west-3.amazonaws.com/amazon_reviews/amazon_reviews_2012.snappy.parquet)
+   - 
[amazon_reviews_2013](https://datasets-documentation.s3.eu-west-3.amazonaws.com/amazon_reviews/amazon_reviews_2013.snappy.parquet)
+   - 
[amazon_reviews_2014](https://datasets-documentation.s3.eu-west-3.amazonaws.com/amazon_reviews/amazon_reviews_2014.snappy.parquet)
 
+   - 
[amazon_reviews_2015](https://datasets-documentation.s3.eu-west-3.amazonaws.com/amazon_reviews/amazon_reviews_2015.snappy.parquet)
+4. Execute the following commands to load the datasets
+
+```Bash
+curl --location-trusted -u root: -T amazon_reviews_2010.snappy.parquet -H 
"format:parquet" 
http://${BE_IP}:${BE_PORT}/api/${DB}/amazon_reviews/_stream_load
+curl --location-trusted -u root: -T amazon_reviews_2011.snappy.parquet -H 
"format:parquet" 
http://${BE_IP}:${BE_PORT}/api/${DB}/amazon_reviews/_stream_load
+curl --location-trusted -u root: -T amazon_reviews_2012.snappy.parquet -H 
"format:parquet" 
http://${BE_IP}:${BE_PORT}/api/${DB}/amazon_reviews/_stream_load
+curl --location-trusted -u root: -T amazon_reviews_2013.snappy.parquet -H 
"format:parquet" 
http://${BE_IP}:${BE_PORT}/api/${DB}/amazon_reviews/_stream_load
+curl --location-trusted -u root: -T amazon_reviews_2014.snappy.parquet -H 
"format:parquet" 
http://${BE_IP}:${BE_PORT}/api/${DB}/amazon_reviews/_stream_load
+curl --location-trusted -u root: -T amazon_reviews_2015.snappy.parquet -H 
"format:parquet" 
http://${BE_IP}:${BE_PORT}/api/${DB}/amazon_reviews/_stream_load
+```
+
+5. Check and verify: After the above steps, execute the following statements 
in the MySQL client to check and see the size of the dataset. It can be seen 
from below that 135589433 rows are loaded and they take up 25.873GB in Apache 
Doris, which is 30% smaller than the original Parquet files. 
+
+```SQL
+mysql> SELECT COUNT() FROM amazon_reviews;
++-----------+
+| count(*)  |
++-----------+
+| 135589433 |
++-----------+
+1 row in set (0.02 sec)
+mysql> SHOW DATA FROM amazon_reviews;
++----------------+----------------+-----------+--------------+-----------+------------+
+| TableName      | IndexName      | Size      | ReplicaCount | RowCount  | 
RemoteSize |
++----------------+----------------+-----------+--------------+-----------+------------+
+| amazon_reviews | amazon_reviews | 25.873 GB | 16           | 135589433 | 
0.000      |
+|                | Total          | 25.873 GB | 16           |           | 
0.000      |
++----------------+----------------+-----------+--------------+-----------+------------+
+2 rows in set (0.00 sec)
+```
+
+## Accelerate text searches
+
+### No index
+
+Now let's try running text searches on the `review_body` field. Specifically, 
we're trying to retrieve the top 5 products whose reviews include the keywords 
"is super awesome". The results should be sorted in descending order based on 
the number of reviews. Each result should include the product ID, a randomly 
selected product title, the average star rating, and the total number of 
reviews. 
+
+This is the query statement:
+
+```SQL
+SELECT
+    product_id,
+    any(product_title),
+    AVG(star_rating) AS rating,
+    COUNT() AS count
+FROM
+    amazon_reviews
+WHERE
+    review_body LIKE '%is super awesome%'
+GROUP BY
+    product_id
+ORDER BY
+    count DESC,
+    rating DESC,
+    product_id
+LIMIT 5;
+```
+
+Since the `review_body` field contains lengthy reviews, such text searches can 
be time-consuming. Without enabling any indexes, it took **7.6 seconds** to 
return the results: 
+
+```SQL
++------------+------------------------------------------+--------------------+-------+
+| product_id | any_value(product_title)                 | rating             | 
count |
++------------+------------------------------------------+--------------------+-------+
+| B00992CF6W | Minecraft                                | 4.8235294117647056 | 
   17 |
+| B009UX2YAC | Subway Surfers                           | 4.7777777777777777 | 
    9 |
+| B00DJFIMW6 | Minion Rush: Despicable Me Official Game |              4.875 | 
    8 |
+| B0086700CM | Temple Run                               |                  5 | 
    6 |
+| B00KWVZ750 | Angry Birds Epic RPG                     |                  5 | 
    6 |
++------------+------------------------------------------+--------------------+-------+
+5 rows in set (7.60 sec)
+```
+
+### Ngram BloomFilter index
+
+Now, let's try accelerating such text searches using the Ngram BloomFilter 
index.
+
+- `gram_size`: the value of "N" in "Ngram", representing the length of 
consecutive characters. In the snippet below, `"gram_size"="10"` means that the 
texts will be divided into a number of 10-character strings, which are the 
basis of the Ngram BloomFilter index.
+- `bf_size`: the size of the BloomFilter in bytes. `"bf_size"="10240"` 
indicates that the BloomFilter occupies 10240 bytes of space.
+
+```SQL
+ALTER TABLE amazon_reviews ADD INDEX review_body_ngram_idx(review_body) USING 
NGRAM_BF PROPERTIES("gram_size"="10", "bf_size"="10240");
+```
+
+This time, the query is finished within **0.93 seconds**. That means Ngram 
BloomFilter brings a speedup of **8 times**.
+
+```SQL
++------------+------------------------------------------+--------------------+-------+
+| product_id | any_value(product_title)                 | rating             | 
count |
++------------+------------------------------------------+--------------------+-------+
+| B00992CF6W | Minecraft                                | 4.8235294117647056 | 
   17 |
+| B009UX2YAC | Subway Surfers                           | 4.7777777777777777 | 
    9 |
+| B00DJFIMW6 | Minion Rush: Despicable Me Official Game |              4.875 | 
    8 |
+| B0086700CM | Temple Run                               |                  5 | 
    6 |
+| B00KWVZ750 | Angry Birds Epic RPG                     |                  5 | 
    6 |
++------------+------------------------------------------+--------------------+-------+
+5 rows in set (0.93 sec)
+```
+
+**So how does Ngram BloomFilter do the magic?** The way it works can be 
explained in two parts.
+
+- **Ngram tokenization**: When `gram_size=5`, the phrase "hello world" is 
split into ["hello", "ello ", "llo w", "lo wo", "o wor", " worl", "world"]. 
These sub-strings are then hashed and added to a BloomFilter of the `bf_size`. 
Since data in Apache Doris is stored by page, the BloomFilters are generated 
also by page. 
+- **Query acceleration**: For example, to query the word "hello" in the texts, 
"hello" is tokenized and compared with the BloomFilters of each page. If the 
BloomFilter detects a potential match (there might be false positives) in a 
page, that page is loaded for further matching. Otherwise, that page is 
skipped. 
+
+By skipping the irrelevant pages, the BloomFilter index reduces unnecessary 
data scanning and thus greatly reduces query latency.
+
+![img](../static/images/data-storage-structure-in-apache-doris.png)
+
+<div style={{textAlign:'center'}}> Data storage structure in Apache Doris 
</div >
+
+![img](../static/images/illustration-of-ngram-bloomfilter.png)
+
+<div style={{textAlign:'center'}}> Illustration of Ngram BloomFilter </div >
+
+**How to find the optimal parameter configurations for Ngram BloomFilter?**
+
+`gram_size` determines the matching efficiency, while `bf_size` impacts the 
false positive rate. Typically, a large `bf_size` reduces the false positive 
rate but also requires more storage space. Thus, we suggest that you configure 
these two parameters based on these two factors: 
+
+1. Text length:
+   - For short texts (words or phrases), a small `gram_size` (2~4) and a small 
`bf_size` are recommended.
+   - For long texts (sentences or paragraphs), a large `gram_size` (5~10) and 
a large `bf_size` work better.
+2. Query pattern: 
+   - If the queries often involve phrases or complete words, a large 
`gram_size` will be more efficient.
+   - For fuzzy matching or diverse queries, a small `gram_size` allows more 
flexible matching.
+
+### Inverted index
+
+[Inverted 
index](https://doris.apache.org/blog/Building-A-Log-Analytics-Solution-10-Times-More-Cost-Effective-Than-Elasticsearch)
 is another way to accelerate text searches. Creating inverted index is simple:
+
+1.  **Add inverted index**: Refer to the snippet below to create inverted 
index for the `review_body` column of the `amazon_reviews` table. Inverted 
index supports phrase searching, in which the order of the tokenized words will 
affect the search results.
+2. **Add inverted index for historical data**: You can also create inverted 
index for historical data.
+
+```SQL
+ALTER TABLE amazon_reviews ADD INDEX review_body_inverted_idx(`review_body`) 
+    USING INVERTED PROPERTIES("parser" = "english","support_phrase" = "true"); 
+BUILD INDEX review_body_inverted_idx ON amazon_reviews;
+```
+
+3. **Check and verify**: You can check and see the created indexes using the 
following statement:
+
+```SQL
+mysql> show BUILD INDEX WHERE TableName="amazon_reviews";
++-------+----------------+----------------+-----------------------------------------------------------------------------------------------------------------------------------+-------------------------+-------------------------+---------------+----------+------+----------+
+| JobId | TableName      | PartitionName  | AlterInvertedIndexes               
                                                                                
               | CreateTime              | FinishTime              | 
TransactionId | State    | Msg  | Progress |
++-------+----------------+----------------+-----------------------------------------------------------------------------------------------------------------------------------+-------------------------+-------------------------+---------------+----------+------+----------+
+| 10152 | amazon_reviews | amazon_reviews | [ADD INDEX 
review_body_inverted_idx (
+review_body
+) USING INVERTED PROPERTIES("parser" = "english", "support_phrase" = "true")], 
 | 2024-01-23 15:42:28.658 | 2024-01-23 15:48:42.990 | 11            | FINISHED 
|      | NULL     |
++-------+----------------+----------------+-----------------------------------------------------------------------------------------------------------------------------------+-------------------------+-------------------------+---------------+----------+------+----------+
+1 row in set (0.00 sec)
+```
+
+If you want to see how tokenization works, you can test with the `TOKENIZE` 
function. Just input the text that needs to be tokenized and the parameters: 
+
+```SQL
+mysql> SELECT TOKENIZE('I can honestly give the shipment and package 100%, it 
came in time that it was supposed to with no hasels, and the book was in 
PERFECT condition.
+super awesome buy, and excellent for my college classs', '"parser" = 
"english","support_phrase" = "true"');
++---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
+| tokenize('I can honestly give the shipment and package 100%, it came in time 
that it was supposed to with no hasels, and the book was in PERFECT condition. 
super awesome buy, and excellent for my college classs', '"parser" = 
"english","support_phrase" = "true"')                                           
   |
++---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
+| ["i", "can", "honestly", "give", "the", "shipment", "and", "package", "100", 
"it", "came", "in", "time", "that", "it", "was", "supposed", "to", "with", 
"no", "hasels", "and", "the", "book", "was", "in", "perfect", "condition", 
"super", "awesome", "buy", "and", "excellent", "for", "my", "college", 
"classs"] |
++---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
+1 row in set (0.05 sec)
+```
+
+With inverted index, now we retrieve customer reviews containing "is super 
awesome" using `MATCH_PHRASE`. 
+
+```SQL
+SELECT
+    product_id,
+    any(product_title),
+    AVG(star_rating) AS rating,
+    COUNT() AS count
+FROM
+    amazon_reviews
+WHERE
+    review_body MATCH_PHRASE 'is super awesome'
+GROUP BY
+    product_id
+
+ORDER BY
+    count DESC,
+    rating DESC,
+    product_id
+LIMIT 5;
+```
+
+The clause `review_body MATCH_PHRASE 'is super awesome'` searches for text 
fragments in the `review_body` column that contains all three keywords "is", 
"super", and "awesome" in that exact order, with no other words in between. 
+
+The `MATCH` query is case-insensitive, which is also what sets it apart from 
the `LIKE` query. The `MATCH` query is more efficient in large datasets.
+
+Results show that inverted index has decreased the query latency to **0.19 
seconds**, bringing a **4-time performance increase** compared to the Ngram 
BloomFilter index, and **a nearly 40-time increase** compared to having no 
indexes at all.
+
+```SQL
++------------+------------------------------------------+-------------------+-------+
+| product_id | any_value(product_title)                 | rating            | 
count |
++------------+------------------------------------------+-------------------+-------+
+| B00992CF6W | Minecraft                                | 4.833333333333333 |  
  18 |
+| B009UX2YAC | Subway Surfers                           |               4.7 |  
  10 |
+| B00DJFIMW6 | Minion Rush: Despicable Me Official Game |                 5 |  
   7 |
+| B0086700CM | Temple Run                               |                 5 |  
   6 |
+| B00KWVZ750 | Angry Birds Epic RPG                     |                 5 |  
   6 |
++------------+------------------------------------------+-------------------+-------+
+5 rows in set (0.19 sec)
+```
+
+**How does inverted index make it possible?**
+
+Inverted index splits the texts into words and maps each word to a row number. 
Then the tokenized words are sorted alphabetically and and a skip list index is 
created. When executing queries of specific words, the system locates the row 
numbers in this orderly mapping using the skip list index and binary search 
methods. Based on the row numbers, the system retrieves the entire data record. 
+
+This approach avoids line-by-line matching and reduces computational 
complexity from O(n) to O(logn). That's how inverted index speeds up queries on 
large datasets. 
+
+![img](../static/images/illustration-of-inverted-index.png)
+
+*Illustration of Inverted Index*
+
+To provide a deeper understanding of inverted index, I will start from its 
read/write logic. In Doris, logically, inverted index is applied at the column 
level of a table. However, from a physical storage and implementation 
perspective, it is actually built on data files. 
+
+- **Writing**: When data is written to a data file, it is also synchronously 
written to the inverted index file, and the row numbers are matched.
+- **Query**: In a query, if the `WHERE` condition involves a column for which 
an inverted index has been built, Doris will go directly to the index file and 
returns the corresponding row numbers. Then, based on the row numbers, it skips 
the irrelevant pages and rows and only reads the target rows. 
+
+In short, inverted index enables high-speed text searches by mapping, and its 
implementation relies on the coordination of data files and index files.
+
+## Accelerate non-primary key column queries
+
+To showcase the impact of inverted index on non-primary key column queries, 
let's try some multi-dimensional queries.
+
+### No index
+
+Retrieve the review from Customer ID 13916588 about Product ID B002DMK1R0. 
Without indexes, the system has to scan the entire table. The query is finished 
within **1.81 seconds**.
+
+```SQL
+mysql> SELECT product_title,review_headline,review_body,star_rating 
+FROM amazon_reviews 
+WHERE product_id='B002DMK1R0' AND customer_id=13916588;
++-----------------------------------------------------------------+----------------------+-----------------------------------------------------------------------------------------------------------------------------+-------------+
+| product_title                                                   | 
review_headline      | review_body                                              
                                                                   | 
star_rating |
++-----------------------------------------------------------------+----------------------+-----------------------------------------------------------------------------------------------------------------------------+-------------+
+| Magellan Maestro 4700 4.7-Inch Bluetooth Portable GPS Navigator | Nice 
Features But... | This is a great GPS. Gets you where you are going. Don't 
forget to buy the seperate (grr!) cord for the traffic kit though! |           
4 |
++-----------------------------------------------------------------+----------------------+-----------------------------------------------------------------------------------------------------------------------------+-------------+
+1 row in set (1.81 sec)
+```
+
+### Inverted index
+
+This query is executed in a different way than what is said above, because the 
system does not have to tokenize the `product_id` and `customer_id`, but 
creates a Value→RowID inverted index table. 
+
+First of all, create inverted index via the following statement: 
+
+```SQL
+ALTER TABLE amazon_reviews ADD INDEX product_id_inverted_idx(product_id) USING 
INVERTED ;
+ALTER TABLE amazon_reviews ADD INDEX customer_id_inverted_idx(customer_id) 
USING INVERTED ;
+BUILD INDEX product_id_inverted_idx ON amazon_reviews;
+BUILD INDEX customer_id_inverted_idx ON amazon_reviews;
+```
+
+With inverted index, the same query is finished within **0.06 seconds**. That 
represents a **30-time** higher speed compared to the previous 1.81 seconds.
+
+```SQL
+mysql> SELECT product_title,review_headline,review_body,star_rating FROM 
amazon_reviews WHERE product_id='B002DMK1R0' AND customer_id='13916588';
++-----------------------------------------------------------------+----------------------+-----------------------------------------------------------------------------------------------------------------------------+-------------+
+| product_title                                                   | 
review_headline      | review_body                                              
                                                                   | 
star_rating |
++-----------------------------------------------------------------+----------------------+-----------------------------------------------------------------------------------------------------------------------------+-------------+
+| Magellan Maestro 4700 4.7-Inch Bluetooth Portable GPS Navigator | Nice 
Features But... | This is a great GPS. Gets you where you are going. Don't 
forget to buy the seperate (grr!) cord for the traffic kit though! |           
4 |
++-----------------------------------------------------------------+----------------------+-----------------------------------------------------------------------------------------------------------------------------+-------------+
+1 row in set (0.06 sec)
+```
+
+### Profile
+
+This is an excerpt of the SegmentIterator Profile, from which you can tell why 
inverted index accelerates query execution.
+
+(Note that if you need to check the Profile of a query, make sure you have 
executed `SET enable_profile=true;` in the MySQL client before you execute the 
query. Then you can check the Profile at 
*http://FE_IP:FE_HTTP_PORT/QueryProfile*)
+
+```YAML
+SegmentIterator:
+  - FirstReadSeekCount: 0
+  - FirstReadSeekTime: 0ns
+  - FirstReadTime: 13.119ms
+  - IOTimer: 19.537ms
+  - InvertedIndexQueryTime: 11.583ms
+  - RawRowsRead: 1
+  - RowsConditionsFiltered: 0
+  - RowsInvertedIndexFiltered: 16.907403M (16907403)
+  - RowsShortCircuitPredInput: 0
+  - RowsVectorPredFiltered: 0
+  - RowsVectorPredInput: 0
+  - ShortPredEvalTime: 0ns
+  - TotalPagesNum: 27
+  - UncompressedBytesRead: 3.71 MB
+  - VectorPredEvalTime: 0ns
+```
+
+`RowsInvertedIndexFiltered: 16.907403M (16907403)` and `RawRowsRead: 1` means 
that the inverted index has filtered out 16907403 rows and only reads 1 row 
(the target row). `FirstReadTime: 13.119ms` means that it takes 13.119 ms to 
read the page where the target row is located, and `InvertedIndexQueryTime: 
11.583ms` means that the system **filters out 16907403 rows within only 11.58 
ms**. 
+
+For comparision, this is the SegmentIterator Profile when no index is used:
+
+```YAML
+SegmentIterator:
+  - FirstReadSeekCount: 9.374K (9374)
+  - FirstReadSeekTime: 400.522ms
+  - FirstReadTime: 3s144ms
+  - IOTimer: 2s564ms
+  - InvertedIndexQueryTime: 0ns
+  - RawRowsRead: 16.680706M (16680706)
+  - RowsConditionsFiltered: 226.698K (226698)
+  - RowsInvertedIndexFiltered: 0
+  - RowsShortCircuitPredInput: 1
+  - RowsVectorPredFiltered: 16.680705M (16680705)
+  - RowsVectorPredInput: 16.680706M (16680706)
+  - RowsZonemapFiltered: 226.698K (226698)
+  - ShortPredEvalTime: 2.723ms
+  - TotalPagesNum: 5.421K (5421)
+  - UncompressedBytesRead: 277.05 MB
+  - VectorPredEvalTime: 8.114ms
+```
+
+Without inverted index, it takes 3.14s to load 16680706 rows (`FirstReadTime: 
3s144ms`). Then, the system conducts filtering by Predicate Evaluate and 
screens out 16680705 rows. The conditional filtering process only takes less 
than 10ms, making original data loading the most time-consuming task.
+
+To sum up, inverted index increases query execution efficiency by enabling 
quick retrieval of the target rows and thus reducing unnecessary data loading.
+
+## Accelerate low-cardinality text column queries
+
+So inverted index is a big accelerator for queries on high-cardinality text 
columns, but that might raise a concern: For low-cardinality columns, will too 
many indexes bring excessive overheads and undermine query performance? 
+
+The answer is: no. Let me show you why and how. The following example uses 
`product_category` as the predicate column for filtering. 
+
+```SQL
+mysql> SELECT COUNT(DISTINCT product_category) FROM amazon_reviews ;
++----------------------------------+
+| count(DISTINCT product_category) |
++----------------------------------+
+|                               43 |
++----------------------------------+
+1 row in set (0.57 sec)
+```
+
+As is shown, the `product_category` column has only 43 distinct categories, 
making it a typical low-cardinality text column. Now, let's add inverted index 
to it.
+
+```SQL
+ALTER TABLE amazon_reviews ADD INDEX 
product_category_inverted_idx(`product_category`) USING INVERTED;
+BUILD INDEX product_category_inverted_idx ON amazon_reviews;
+```
+
+After adding inverted index, run the following SQL query to retrieve the top 3 
products with the most reviews in the "Mobile_Electronics" product category. 
+
+```SQL
+SELECT 
+    product_id,
+    product_title,
+    AVG(star_rating) AS rating,
+    any(review_body),
+    any(review_headline),
+    COUNT(*) AS count 
+FROM 
+    amazon_reviews 
+WHERE 
+    product_category = 'Mobile_Electronics' 
+GROUP BY 
+    product_title, product_id 
+ORDER BY 
+    count DESC 
+LIMIT 10;
+```
+
+With inverted index, the query takes 1.54s to finish.
+
+```SQL
++------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 [...]
+| product_id | product_title                                                   
                                                                                
                                                       | rating             | 
any_value(review_body)                                                          
                                                                                
                                                                                
                [...]
++------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 [...]
+| B00J46XO9U | iXCC Lightning Cable 3ft, iPhone charger, for iPhone X, 8, 8 
Plus, 7, 7 Plus, 6s, 6s Plus, 6, 6 Plus, SE 5s 5c 5, iPad Air 2 Pro, iPad mini 
2 3 4, iPad 4th Gen [Apple MFi Certified](Black and White) | 4.3766233766233764 
| Great cable and works well. Exact fit as Apple cable. I would recommend this 
to anyone who is looking to save money and for a quality cable.                 
                                                                                
                   [...]
+| B004911E9M | Wall AC Charger USB Sync Data Cable for iPhone 4, 3GS, and iPod 
                                                                                
                                                       | 2.4281805745554035 | A 
total waste of money for me because I needed it for a iPhone 4.  The plug will 
only go in upside down and thus won't work at all.                              
                                                                                
               [...]
+| B002D4IHYM | New Trent Easypak 7000mAh Portable Triple USB Port External 
Battery Charger/Power Pack for Smartphones, Tablets and more (w/built-in USB 
cable)                                                        | 
4.5216095380029806 | I bought this product based on the reviews that i read and 
i am very glad that i did. I did have a problem with the product charging my 
itouch after i received it but i emailed the company and they corrected the 
problem immediately. VERY GOOD customer ser [...]
++------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 [...]
+3 rows in set (1.54 sec)
+```
+
+Now let's try again without enabling inverted index. The same query takes 1.8s 
to finish. (You can simply disable inverted index by executing `set 
enable_inverted_index_query=false;` in the MySQL client.)
+
+```SQL
++------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 [...]
+| product_id | product_title                                                   
                                                                                
                                                       | rating             | 
any_value(review_body)                                                          
                                                                                
                                                                                
                [...]
++------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 [...]
+| B00J46XO9U | iXCC Lightning Cable 3ft, iPhone charger, for iPhone X, 8, 8 
Plus, 7, 7 Plus, 6s, 6s Plus, 6, 6 Plus, SE 5s 5c 5, iPad Air 2 Pro, iPad mini 
2 3 4, iPad 4th Gen [Apple MFi Certified](Black and White) | 4.3766233766233764 
| These cables are great. They feel quality, and best of all, they work as they 
should. I have no issues with them whatsoever and will be buying more when 
needed.                                                                         
                       [...]
+| B004911E9M | Wall AC Charger USB Sync Data Cable for iPhone 4, 3GS, and iPod 
                                                                                
                                                       | 2.4281805745554035 | I 
ordered two of these chargers for an Iphone 4. Then I started experiencing 
weird behavior from the touch screen. It would select the wrong area of the 
screen, or it would refuse to scroll beyond a certain point and jump back up to 
the top of the page. T [...]
+| B002D4IHYM | New Trent Easypak 7000mAh Portable Triple USB Port External 
Battery Charger/Power Pack for Smartphones, Tablets and more (w/built-in USB 
cable)                                                        | 
4.5216095380029806 | I received this in the mail 4 days ago, and after charging 
it for 6 hours, I've been using it as the sole source for recharging my 3Gs to 
see how long it would work.  I use my Iphone A LOT every day and usually by the 
time I get home it's down to 50% or l [...]
++------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 [...]
+3 rows in set (1.80 sec)
+```
+
+To sum up, inverted index can bring a 15% speedup for queries on 
low-cardinality columns. So it is not only harmless but also beneficial to 
low-cardinality data filtering.
+
+In addition, Apache Doris adopts effective dictionary encoding and compression 
for low-cardinality columns. It also utilizes built-in indexes like ZoneMap for 
filtering. Thus, it can deliver ideal query performance even without inverted 
indexes.
+
+## Conclusion
+
+Inverted index in Apache Doris optimizes data filtering based on the predicate 
column (the `WHERE` clause in SQL queries). It reduces unnecessary data 
scanning and significantly increases query speed on high-cardinality columns 
and guarantees no negative effects on low-cardinality columns. It supports 
lightweight index management, including ADD/DROP INDEX and BUILD INDEX. It can 
be easily enabled or disabled via `enable_inverted_index_query=true/false`. 
+
+Inverted index and NGram BloomFilter index apply to different scenarios. This 
is how you decide which one is the optimal choice: 
+
+- **Non-primary key column queries**: These cases often involve widely 
scattered values and a low hit rate. **Inverted index** can work in conjunction 
with the built-in smart indexes in Doris to accelerate these queries. It has 
well-established support for scalar data types including characters, numerics, 
and datetime.
+- **Text searches on short texts**: If the dataset includes short texts that 
are highly diverse, **NGram BloomFilter** will be an effective choice for fuzzy 
matching (`LIKE`) on short texts. If the short texts are very similar (with 
lots of identical content), **inverted index** will be more efficient because 
it ensures a smaller dictionary and faster retrieval of the row numbers. 
+- **Text searches on long texts**: Inverted index is a better choice for long 
texts. Compared to brutal-force string matching, it largely reduces CPU 
resource consumption. 
+
+Inverted index has been available in Apache Doris for almost a year and stood 
the test of many users in their production environment with massive data. In 
future versions of Apache Doris regarding inverted index, we plan to add 
support for:
+
+- **Self-defined tokenization**: provides a user-defined tokenizer to fit in 
different use cases.
+- **More data types**: Users will be able to create inverted index for complex 
data types including Array and Map.
+
+If you encounter any issues while trying it out in Apache Doris or would like 
to know more details, join our 
[Slack](https://join.slack.com/t/apachedoriscommunity/shared_invite/zt-1t3wfymur-0soNPATWQ~gbU8xutFOLog)
 community and talk to us!
\ No newline at end of file
diff --git a/blog/release-note-2.0.4.md b/blog/release-note-2.0.4.md
index 45e060ace6c..cac579c5dd3 100644
--- a/blog/release-note-2.0.4.md
+++ b/blog/release-note-2.0.4.md
@@ -6,7 +6,7 @@
     'author': 'Apache Doris',
     'tags': ['Release Notes'],
     'picked': "true",
-    'order': "1",
+    'order': "2",
     "image": '/images/2.0.4.png'
 }
 ---
@@ -37,10 +37,10 @@ Thanks to our community users and developers, about 333 
improvements and bug fix
 
 ## Behavior change
 - More reasonable and accurate precision and scale inference for decimal data 
type
-  [[improvement](decimal) use new way for decimal arithmetic precision 
promotion(#27787) #28034](https://github.com/apache/doris/pull/28034)
+  - [[improvement](decimal) use new way for decimal arithmetic precision 
promotion](https://github.com/apache/doris/pull/28034)
 
 - Support drop policy for user or role
-  [[fix](polixy)support drop policy for user or role 
#29488](https://github.com/apache/doris/pull/29488)
+  - [[fix](polixy)support drop policy for user or 
role](https://github.com/apache/doris/pull/29488)
 
 ## New features
 
diff --git a/src/components/recent-blogs/recent-blogs.data.ts 
b/src/components/recent-blogs/recent-blogs.data.ts
index 3300bf6d376..216f2e392bd 100644
--- a/src/components/recent-blogs/recent-blogs.data.ts
+++ b/src/components/recent-blogs/recent-blogs.data.ts
@@ -1,6 +1,6 @@
 export const RECENT_BLOGS_POSTS = [
     {
-        label: `The financial sector's choice: fast, secure, and highly 
available real-time data warehousing based on Apache Doris`,
+        label: `Financial data warehousing: fast, secure, and highly available 
with Apache Doris`,
         link: 
'https://doris.apache.org/blog/a-fast-secure-high-available-real-time-data-warehouse-based-on-apache-doris',
     },
     {
diff --git a/src/constant/newsletter.data.ts b/src/constant/newsletter.data.ts
index 22abaff1be8..6f940f4a08f 100644
--- a/src/constant/newsletter.data.ts
+++ b/src/constant/newsletter.data.ts
@@ -1,4 +1,11 @@
 export const NEWSLETTER_DATA = [
+    {
+        tags: ['Tech Sharing'],
+        title: "A deep dive into inverted index: how it speeds up text 
searches by 40 times",
+        content: `This post is a deep dive into inverted index and NGram 
BloomFilter index, providing a hands-on guide to applying them for various 
queries.`,
+        to: 
'/blog/inverted-index-accelerates-text-searches-by-40-time-apache-doris',
+        image: 'how-inverted-index-accelerates-text-searches-by-40-times.png',
+    },
     {
         tags: ['Release Notes'],
         title: "Apache Doris 2.0.4 just released",
@@ -20,14 +27,14 @@ export const NEWSLETTER_DATA = [
         to: 
'/blog/apache-doris-speeds-up-data-reporting-tagging-and-data-lake-analytics',
         image: 'apache-doris-speeds-up-data-reporting-data-lake-analytics.jpg',
     },
-    {
-        tags: ['Top News'],
-        title: 'Apache Doris Summit Asia 2023: what can you expect from Apache 
Doris as a data warehouse?',
-        content:
-            'When it is cranberry and pumpkin season, we had the unforgettable 
Apache Doris Summit Asia 2023 with our remarkable committers, users, and 
community partners, to honor what we have achieved in the past year, and 
provide a preview of where we are going next.',
-        to: 
'/blog/apache-doris-summit-asia-2023-what-can-you-expect-from-apache-doris-as-a-data-warehouse',
-        image: 'newsletter-1.png',
-    },
+    // {
+    //     tags: ['Top News'],
+    //     title: 'Apache Doris Summit Asia 2023: what can you expect from 
Apache Doris as a data warehouse?',
+    //     content:
+    //         'When it is cranberry and pumpkin season, we had the 
unforgettable Apache Doris Summit Asia 2023 with our remarkable committers, 
users, and community partners, to honor what we have achieved in the past year, 
and provide a preview of where we are going next.',
+    //     to: 
'/blog/apache-doris-summit-asia-2023-what-can-you-expect-from-apache-doris-as-a-data-warehouse',
+    //     image: 'newsletter-1.png',
+    // },
     // {
     //     tags: ['Tech Sharing'],
     //     title: 'Introduction to Apache Doris: a next-generation real-time 
data warehouse',
diff --git a/static/images/data-storage-structure-in-apache-doris.png 
b/static/images/data-storage-structure-in-apache-doris.png
new file mode 100644
index 00000000000..071f0c69471
Binary files /dev/null and 
b/static/images/data-storage-structure-in-apache-doris.png differ
diff --git 
a/static/images/how-inverted-index-accelerates-text-searches-by-40-times.png 
b/static/images/how-inverted-index-accelerates-text-searches-by-40-times.png
new file mode 100644
index 00000000000..3b53b99e7d7
Binary files /dev/null and 
b/static/images/how-inverted-index-accelerates-text-searches-by-40-times.png 
differ
diff --git a/static/images/illustration-of-inverted-index.png 
b/static/images/illustration-of-inverted-index.png
new file mode 100644
index 00000000000..672bfb15b29
Binary files /dev/null and b/static/images/illustration-of-inverted-index.png 
differ
diff --git a/static/images/illustration-of-ngram-bloomfilter.png 
b/static/images/illustration-of-ngram-bloomfilter.png
new file mode 100644
index 00000000000..ec6873c067a
Binary files /dev/null and 
b/static/images/illustration-of-ngram-bloomfilter.png differ
diff --git a/static/images/sample-dataset.png b/static/images/sample-dataset.png
new file mode 100644
index 00000000000..5eb9c6e00f3
Binary files /dev/null and b/static/images/sample-dataset.png differ


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

(doris-website) branch master updated: [blog](update) Update Blog : How Inverted Index Accelerates Text Searches by 40 Times (#400)

Reply via email to