[incubator-doris-website] branch master updated: remove

jiafengzheng Fri, 10 Jun 2022 02:08:47 -0700

This is an automated email from the ASF dual-hosted git repository.

jiafengzheng pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/incubator-doris-website.git



The following commit(s) were added to refs/heads/master by this push:
     new 39fe56efe8f remove
39fe56efe8f is described below

commit 39fe56efe8f8961190e6f1568bce5d63469ceef2
Author: jiafeng.zhang <zhang...@gmail.com>
AuthorDate: Fri Jun 10 17:08:16 2022 +0800

    remove
    
    remove
---
 .../doris-storage-reader-compaction.md             | 228 ----------------
 .../DorisInternals/doris-storage-struct-design.md  | 303 ---------------------
 .../DorisInternals/doris-storage-writer-delete.md  | 150 ----------
 3 files changed, 681 deletions(-)

diff --git a/blogs/en/DorisInternals/doris-storage-reader-compaction.md 
b/blogs/en/DorisInternals/doris-storage-reader-compaction.md
deleted file mode 100644
index 7672ad38eb8..00000000000
--- a/blogs/en/DorisInternals/doris-storage-reader-compaction.md
+++ /dev/null
@@ -1,228 +0,0 @@
----
-{
-  "title": "Apache Doris storage layer design three reading process, 
Compaction process analysis",
-  "description": "This article introduces in detail the internal 
implementation process of the Doris system during the data writing process, as 
well as the implementation process of Doris's conditional deletion of data and 
batch deletion by key.",
-  "date": "2022-05-20",
-  "metaTitle": "Apache Doris storage layer design three reading process, 
Compaction process analysis",
-  "isArticle": true,
-  "language": "en",
-  "author": "ApacheDoris",
-  "layout": "Article",
-  "sidebar": false,
-  "categories": "DorisInternals",
-}
----
-
-<!--
-Licensed to the Apache Software Foundation (ASF) under one
-or more contributor license agreements.  See the NOTICE file
-distributed with this work for additional information
-regarding copyright ownership.  The ASF licenses this file
-to you under the Apache License, Version 2.0 (the
-"License"); you may not use this file except in compliance
-with the License.  You may obtain a copy of the License at
-
-  http://www.apache.org/licenses/LICENSE-2.0
-
-Unless required by applicable law or agreed to in writing,
-software distributed under the License is distributed on an
-"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
-KIND, either express or implied.  See the License for the
-specific language governing permissions and limitations
-under the License.
--->
-
-# Apache Doris storage layer design three reading process, Compaction process 
analysis
-
-## 1 Overall introduction
-
-Doris is an interactive SQL data warehouse based on MPP architecture, mainly 
used to solve near real-time reports and multi-dimensional analysis. The 
efficient import and query of Doris is inseparable from the sophisticated 
design of its storage structure.
-
-This article mainly analyzes the implementation principle of the storage layer 
of the Doris BE module by reading the code of the Doris BE module, and expounds 
and decrypts the core technology behind the efficient writing and query 
capabilities of Doris. It includes Doris column storage design, index design, 
data read and write process, Compaction process and other functions.
-
-This article introduces in detail the internal implementation process of the 
Doris system during the data writing process, as well as the implementation 
process of Doris's conditional deletion of data and batch deletion by key.
-
-## 2 Read process
-
-### 2.1 Overall reading process
-
-The read process is the reverse process of writing, but the read process is 
relatively complicated, mainly because of a large number of read optimizations. 
The entire reading process is divided into two stages, one is the init process, 
and the other is the process of obtaining the next_block data block. The 
specific process is shown in the following figure:
-
-![img](/images/blogs/storage/74F6DA700653418B9828E27EEAACA8ED.png)
-
-The hierarchical relationship is as follows:
-
-OlapScanner encapsulates the overall read operation of a tablet data;
-
-Reader processes the read parameters, and provides differentiated processing 
for reading according to three different models;
-
-CollectIterator contains multiple RowsetReaders in the tablet. These 
RowsetReaders have version order. CollectIterator merges these RowsetReaders 
into a unified Iterator function and provides a merged comparator;
-
-RowsetReader is responsible for reading a Rowset;
-
-RowwiseIterator provides an Iterator function for unified access to all 
Segments in a Rowset. The merge strategy here can use Merge or Union according 
to the data sorting;
-
-SegmentIterator corresponds to the data read of a segment. The segment read 
will calculate the corresponding line number information read according to the 
query conditions and the index, seek to the corresponding page, and read the 
data. Among them, after filtering conditions, a bitmap will be generated for 
the accessible row information to record, BitmapRangeIterator is a separate 
iterator that can access this bitmap according to the range;
-
-ColumnIterator provides an iterator for uniform access to a column's related 
data and indexes. ColumnReader, each IndexReader, etc. correspond to the 
reading of specific data and index information.
-
-### 2.2 The main process of reading the Init stage
-
-The execution flow of the initialization phase is as follows:
-
-![img](/images/blogs/storage/61A2C6F0D26F4DECB3AEDF2A5F846790.png)
-
-#### 2.2.1 OlapScanner query parameter construction
-
-Find the RowsetReader that needs to be read according to the version specified 
by the query (depending on the rowset_graph version path map of version 
management to obtain the shortest path of the query version range);
-
-1. Set the query information, including \_tablet, read type 
reader_type=READER_QUERY, whether to aggregate, \_version (from 0 to the 
specified version);
-
-2. Set query condition information, including filter field and is_nulls field;
-
-3. Set the return column information;
-
-4. Set the key_ranges range of the query (the range array of keys, which can 
be filtered by short key index);
-
-5. Initialize the Reader object.
-
-#### 2.2.2 Reader's Init process
-
-1. Initialize the conditions query condition object;
-
-2. Initialize the bloomFilter column set (eq, in conditions, columns with 
bloomFilter added);
-
-3. Initialize delete_handler. It includes all the deletion information 
existing in the tablet, including the version and the corresponding deletion 
condition array;
-
-4. Initialize the columns that are passed to the lower layer to be read and 
returned, including the return value and the columns in the condition object;
-
-5. Initialize the RowCusor row cursor object corresponding to the start key 
and end key of key_ranges;
-
-6. Set up RowsetReader and CollectIterator for the constructed information. 
The Rowset object is initialized, and the RowsetReader is added to the 
CollectIterator;
-
-7. Call CollectIterator to get the current row (actually the first row here), 
start the reading process here, and read it for the first time.
-
-#### 2.2.3 Init process of RowsetReader
-
-Build a SegmentIterator and filter out delete conditions in delete_handler 
that are smaller than the current Rowset version;
-
-Build a RowwiseIterator (an aggregate iterator for SegmentIterator), and add 
the SegmentIterator to be read to the RowwiseIterator. When all segments are in 
overall order, the sequential reading method of union iterator is adopted, 
otherwise, the merge iterator method of merged reading is adopted.
-
-#### 2.2.4 Init process of Segmentlterator
-
-1. Initialize the ReadableBlock, which is used to read the object of the 
current Segment file, and actually read the file;
-
-2. Initialize \_row_bitmap to store the row number filtered by the index, 
using the bitmap structure;
-
-3. Build a ColumnIterator, where only the columns need to be read;
-
-If the Column has a BitmapIndex index, initialize the BitmapIndexIterator of 
each Column;
-
-Filter data by SortkeyIndex index. When the query has key_ranges, obtain the 
row number range of the hit data through key_range. The steps are as follows: 
(1) According to the upper and lower keys of each key_range, find the 
corresponding row numbers upper_rowid and lower_rowid through the SortkeyIndex 
index of Segment, and then merge the obtained RowRanges into row_bitmap;
-
-Filter data conditionally by various indexes. Conditions include query 
conditions and delete conditions to filter information.
-
-- According to the query conditions, use the bitmap index to filter the 
columns that contain the bitmap index in the condition, and query the row 
number list with the existing data to intersect the row_bitmap. Because it is 
precise filtering, delete the filtering conditions from the Condition object.
-- Use the BloomFilter index to filter data according to the equivalent (eq, 
in, is) conditions in the query conditions. Here, it will be judged whether the 
current condition can hit the Page, and the row number range of this Page will 
be intersected with the row_bitmap.
-- Use ZoneMapIndex to filter data according to query conditions and deletion 
conditions, and find the pages that meet the conditions by intersecting the 
index of each Page in ZoneMap. The range of row numbers matched by the 
ZoneMapIndex index is intersected with row_bitmap.
-
-Use row_bitmap to construct a BitmapRangerInterator iterator for subsequent 
reading of data.
-
-### 2.3 The main process of reading the next stage
-
-The execution flow of the next stage is as follows:
-
-![img](/images/blogs/storage/9A6C9C92717B44D5967EF36578B01920.png)
-
-#### 2.3.1 Reader reads next_row_with_aggregation
-
-Read a line in advance when the reader reads, record as the current line. When 
next is called to return the result, the current row will be returned, and then 
the next row will be prefetched as the new current row.
-
-The reading of the reader will be divided into three cases according to the 
type of the model
-
-\_dup_key_next_row reads (detailed data model), returns the current row, and 
then directly reads CollectorIterator to read next as the current row;
-
-Under \_agg_key_next_row reading (aggregation model), after taking 
CollectorIterator to read next, determine whether the next row is the same as 
the key of the current row, if it is the same, perform aggregation calculation, 
and read the next row in a loop; if not, return the current accumulated 
aggregation result, update the current row;
-
-Under \_unique_key_next_row reading (unique key model), the logic is the same 
as the \_agg_key_next_row model, but there are some differences. Since the 
delete operation is supported, it will check whether the current row after 
aggregation is marked as a deleted row. If data is discarded for a deleted row, 
it will not be returned until a data is found that is not a deleted row.
-
-#### 2.3.2 CollectIterator reads next
-
-CollectIterator uses the heap data structure to maintain the set of 
RowsetReaders to be read. The comparison rules are as follows: According to the 
order of the keys of the current row of each RowsetReader, when the keys are 
the same, compare the version of the Rowset.
-
-CollectIterator pops the previous largest RowsetReader from the heap;
-
-Read the next new row for the RowsetReader just popped out as the current row 
of the RowsetReader and put it into the heap for comparison. During the reading 
process, the nextBlock of RowsetReader is called to read by RowBlock. (If the 
currently fetched block is a partially deleted page, the current row is also 
filtered according to the deletion condition.)
-
-Get the current row of the RowsetReader at the top of the queue and return it 
as the current row.
-
-#### 2.3.3 RowsetReader reads next
-
-RowsetReader directly reads next_batch of RowwiseIterator;
-
-RowwiseIterator integrates SegmentIterator. When the Segments in the Rowset 
are ordered as a whole, iteratively returns directly in the Union mode. When 
out of order, return by Merge. RowwiseIterator also returns the row data of the 
current largest SegmentIterator, and each time the next_batch of 
SegmentIterator is called to get the data.
-
-#### 2.3.4 SegmentIterator reads next_batch
-
-According to the BitmapRangerInterator constructed in the init phase, use 
next_range to take out a range_from, range_to of the line number to be read 
each time;
-
-First read the data of the condition column from range_from to range_to row. 
The process is as follows:
-
-Call the seek_to_ordinal of each columnIterator of the conditional column, and 
the current_rowid of the read position of each column is located to the 
cur_rowid of the SegmentIterator. Here is the alignment to the corresponding 
data page by binary check ordinal_index.
-
-Read the data of the condition column. Do one more filter by condition (this 
time exact filter).
-
-Then read the data of the unconditional column, put it into the Rowblock, and 
return to the Rowblock.
-
-## 3 Compaction process
-
-### 3.1 Overall Introduction of Compaction
-
-Doris improves the performance of incrementally aggregated Rowset files 
through Compaction. In the version information of Rowset, two fields, first and 
second, are designed to represent the merged version range of Rowset. When the 
versions first and second of the unmerged cumulative rowset are equal. During 
Compaction, adjacent Rowsets will be merged to generate a new Rowset, and the 
first and second of the version information will also be merged into a larger 
version. On the other hand, [...]
-
-![img](/images/blogs/storage/42A6FA7E0D8E457E9398CE3314427F5D.png)
-
-As shown in the figure above, there are two types of Compaction tasks, base 
compaction and cumulative compaction. The cumulative_point is the key to 
dividing the two strategies.
-
-It can be understood in this way that the right side of cumulative_point is 
the incremental Rowset that has never been merged, and the first and second 
versions of each Rowset are equal; the left side of cumulative_point is the 
merged Rowset, and the first version is not equal to the second version. The 
base compaction and cumulative compaction task processes are basically the 
same, and the difference is only in the logic of selecting the InputRowset to 
be merged.
-
-### 3.2 Detailed process of Compaction
-
-The overall process of Compaction merger is shown in the following figure:
-
-![img](/images/blogs/storage/FA319E53B7D0444F986A8DBC8DF4273A.png)
-
-#### 3.2.1 Calculate cumulative_point
-
-Select the set of InputRowsets that need to be merged for compaction:
-
-Base compaction selection conditions:
-
-1. When there are more than 5 non-cumulative rowsets, merge all non-cumulative 
rowsets;
-
-2. When the ratio of the base rowset whose version first is 0 and other 
non-cumulative disks is less than 10:3, merge all non-cumulative rowsets for 
merging;
-
-3. In other cases, the merger will not be carried out.
-
-Selection criteria for cumulative compaction:
-
-1. The number of segments in the selected Rowset set needs to be greater than 
or equal to 5 and less than or equal to 1000 (configurable), and merge; 2.
-2. When the number of output Rowsets is less than 5, but the deletion 
condition version is greater than the Rowset second version, merge (let the 
deleted Rowsets be merged in quickly);
-3. When both the accumulated base compaction and cumulative compaction time 
are greater than 1 day, merge;
-4. Other cases are not combined.
-
-#### 3.2.2 Execute compaction
-
-Compaction execution can basically be understood as a read process plus a 
write process. Here, the Reader will be turned on for the inputRowsets to be 
merged, and then the records will be read through next_row_with_aggregation. 
Write to the output RowsetWriter to produce a new OutputRowset. The version of 
this Rowset is the full range of the InputRowsets version.
-
-#### 3.2.3 update cumulative_point
-
-Update cumulative_point and pass the OutputRowset produced by cumulative 
compaction to the subsequent base compaction process.
-
-After Compaction, the aggregation key model and the unique key model scattered 
in different Rowsets but with the same key data are merged to achieve the 
effect of pre-computing. At the same time, the number of Rowset files is 
reduced, and the query efficiency is improved.
-
-## 4 Summary
-
-This article introduces the read-related process of the underlying storage 
layer of the Doris system in detail.
-
-The reading process depends on the complete column storage implementation. For 
OLAP wide table scenarios (reading a large number of rows, a small number of 
columns), it can quickly scan and filter based on various index functions 
(including short key, bloom filter, zoon map, bitmap, etc. ), which can skip a 
large number of data scans, and optimizes such as delayed materialization, 
which can correspond to data analysis in various scenarios; the Compaction 
execution process is also optimiz [...]
diff --git a/blogs/en/DorisInternals/doris-storage-struct-design.md 
b/blogs/en/DorisInternals/doris-storage-struct-design.md
deleted file mode 100644
index 7c3e501c9fa..00000000000
--- a/blogs/en/DorisInternals/doris-storage-struct-design.md
+++ /dev/null
@@ -1,303 +0,0 @@
----
-{
-  "title": "Analysis of storage structure design one of Apache Doris storage 
layer design",
-  "description": "This article mainly analyzes the implementation principle of 
the storage layer of the Doris BE module by reading the code of the Doris BE 
module, and expounds and decrypts the core technology behind the efficient 
writing and query capabilities of Doris. It includes Doris column storage 
design, index design, data read and write process, Compaction process, version 
management of Tablet and Rowset, data backup and other functions.",
-  "date": "2022-05-20",
-  "metaTitle": "Analysis of storage structure design one of Apache Doris 
storage layer design",
-  "isArticle": true,
-  "language": "en",
-  "author": "ApacheDoris",
-  "layout": "Article",
-  "sidebar": false,
-  "categories": "DorisInternals",
-}
----
-
-<!--
-Licensed to the Apache Software Foundation (ASF) under one
-or more contributor license agreements.  See the NOTICE file
-distributed with this work for additional information
-regarding copyright ownership.  The ASF licenses this file
-to you under the Apache License, Version 2.0 (the
-"License"); you may not use this file except in compliance
-with the License.  You may obtain a copy of the License at
-
-  http://www.apache.org/licenses/LICENSE-2.0
-
-Unless required by applicable law or agreed to in writing,
-software distributed under the License is distributed on an
-"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
-KIND, either express or implied.  See the License for the
-specific language governing permissions and limitations
-under the License.
--->
-
-# Analysis of storage structure design one of Apache Doris storage layer design
-
-## 1. Overall introduction
-
-Doris is an interactive SQL data warehouse based on MPP architecture, mainly 
used to solve near real-time reporting and multidimensional analysis. Doris's 
efficient import and query are inseparable from the sophisticated design of its 
storage structure.
-
-This article mainly analyzes the implementation principle of the storage layer 
of the Doris BE module by reading the code of the Doris BE module, and expounds 
and decrypts the core technology behind the efficient writing and query 
capabilities of Doris. It includes Doris column storage design, index design, 
data read and write process, Compaction process, version management of Tablet 
and Rowset, data backup and other functions.
-
-This article introduces the storage layer structure of the Segment V2 version, 
including rich functions such as ordered storage, sparse index, prefix index, 
bitmap index, BloomFilter, etc., which can provide extremely fast query 
capabilities for various complex scenarios.
-
-## 2 Design goals
-
-- Bulk import, few updates
-- the vast majority of read requests
-- Wide table scenario, read a lot of rows, few columns
-- Non-transactional scenarios
-- good scalability
-
-## 3 save file format
-
-### 3.1 Storage directory structure
-
-The storage layer's management of storage data is configured through the 
storage_root_path path, which can be multiple. The next layer of the storage 
directory is organized according to buckets. Specific tablets are stored in the 
bucket directory, and subdirectories are named according to tablet_id.
-
-Segment files are stored in the tablet*id directory and managed according to 
SchemaHash. There can be multiple Segment files, generally divided according to 
size, the default is 256MB. Among them, the segment v2 file naming rule is: 
${rowset_id}*${segment_id}.dat.
-
-The specific storage directory storage format is shown in the following figure:
-
-![img](/images/blogs/storage/b9a87a028af1fc40babe2bf136334ec9.png)
-
-### 3.2 Segment v2 file structure
-
-The overall file format of Segment is divided into three parts: data area, 
index area and footer, as shown in the following figure:
-
-![img](/images/blogs/storage/f74e7c5fc5358ce8faa3e79ad7e625d3.png)
-
-- Data Region: used to store the data information of each column, the data 
here is loaded in pages as needed
-- Index Region: Doris stores the index data of each column uniformly in the 
Index Region. The data here will be loaded according to the column granularity, 
so it is stored separately from the column data information.
-- Footer information
-- SegmentFooterPB: Define the metadata information of the file
-- 4 bytes checksum of FooterPB content
-- 4 bytes of FileFooterPB message length for reading FileFooterPB
-
-The following distribution introduces the design of the storage format of each 
part.
-
-## 4 Footer Information
-
-The Footer information segment is at the end of the file, which stores the 
overall structure of the file, including the location of the data field, the 
location of the index field and other information, including SegmentFooterPB, 
CheckSum, Length, MAGIC CODE 4 parts.
-
-SegmentFooterPB data structure is as follows:
-
-![img](/images/blogs/storage/044434894abc13376ee9d14d78c5eff1.png)
-
-SegmentFooterPB uses the PB format for storage, which mainly includes the meta 
information of the column, the meta information of the index, the short key 
index information of the segment, and the total number of rows.
-
-### 4.1 Column meta information
-
-ColumnId: the serial number of the current column in the schema
-
-UniqueId: globally unique id
-
-Type: the type information of the column
-
-Length: the length information of the column
-
-Encoding: encoding format
-
-Compression: Compression format
-
-Dict PagePointer: Dictionary information
-
-### 4.2 Meta information of column index
-
-- OrdinalIndex: Stores the sparse index meta information of the column.
-- ZoneMapIndex: Stores the meta information of the ZoneMap index, including 
the maximum value, the minimum value, whether there is a null value, and 
whether there is no non-null value. SegmentZoneMap stores the global ZoneMap 
information, and PageZoneMaps stores the statistical information of each page.
-- BitMapIndex: Stores the meta information of BitMap index, including BitMap 
type, dictionary data BitMap data.
-- BloomFilterIndex: Stores the BloomFilter index information.
-
-In order to prevent the data volume of the index itself from being too large, 
ZoneMapIndex, BitMapIndex, and BloomFilterIndex adopt two-level Page 
management. Corresponding to the structure of IndexColumnMeta, when a Page can 
be put down, the current Page directly stores the index data, that is, a level 
1 structure is adopted; when a Page cannot be put down, the index data is 
written into a new Page, and the Root Page stores the address information of 
the data Page .
-
-## 5 Ordinal Index
-
-The Ordinal Index index provides the physical address of the Column Data Page 
data page by row number. Ordinal Index can align column-stored data by row, 
which can be understood as a first-level index. When looking for data in other 
indexes, the Ordinal Index is used to find the location of the data Page. 
Therefore, the Ordinal Index index is introduced here first.
-
-In a segment, data is always stored in the sorted order of keys (AGGREGATE 
KEY, UNIQ KEY, and DUPLICATE KEY), that is, the sorting of keys determines the 
physical structure of data storage. The physical structure order of the column 
data is determined. When writing data, the Column Data Page is managed by the 
Ordinal index. The Ordinal index records the position offset, size and row 
number information of the first data item of each Column Data Page. Namely 
Ordinal. In this way, each colu [...]
-
-### 5.1 Storage structure
-
-Ordinal index meta information is stored in OrdinalIndexMeta for each column 
in SegmentFooterPB . The specific structure is shown in the following figure:
-
-![img](/images/blogs/storage/694799b9202d288a80868175bc91c33f.png)
-
-The root page address corresponding to the index data is stored in 
OrdinalIndexMeta. Some optimizations are made here. When the data has only one 
page, the address here can directly point to the only data page; when a page 
cannot be placed, it points to the second page of the OrdinalIndex type 
Hierarchical structure index page, each data item in the index data corresponds 
to the Column Data Page offset position, size and ordinal row number 
information. The Ordinal index index granularity [...]
-
-## 6 column data store
-
-### 6.1 data page storage structure
-
-DataPage is mainly divided into two parts: Data part and Page Footer.
-
-The Data section stores the data of the columns of the current Page. When the 
Null value is allowed, the Bitmap of the Null value is stored separately for 
the null value, and the row number of the Null value is recorded by the RLE 
format encoding through the bool type.
-
-![img](/images/blogs/storage/71b27dcd0a14ebe82562e2b5979d8c19.png)
-
-Page Footer contains Page type Type, UncompressedSize uncompressed data size, 
FirstOrdinal RowId of the first row of the current Page, NumValues is the 
number of rows of the current Page, NullMapSize corresponds to the size of 
NullBitmap.
-
-## 6.2 Data compression
-
-Different encodings are used for different field types. By default, the 
correspondences adopted for different types are as follows:
-
-![img](/images/blogs/storage/89DBFA60C385454DBE666C574DCDE408.png)
-
-The data is compressed in LZ4F format by default.
-
-## 7 Short Key Index Index
-
-### 7.1 Storage structure
-
-Short Key Index prefix index is an index method to quickly query data 
according to a given prefix column based on the sorting of keys (AGGREGATE KEY, 
UNIQ KEY and DUPLICATE KEY). Here, the Short Key Index index also adopts a 
sparse index structure. During the data writing process, an index item will be 
generated every certain number of rows. The number of rows is 1024 rows by 
default for the index granularity, which can be configured. The process is 
shown in the following diagram:
-
-![img](/images/blogs/storage/2a47fa7348f47e00e01bc93e38a1a547.png)
-
-Among them, KeyBytes stores the index item data, and OffsetBytes stores the 
offset of the index item in KeyBytes.
-
-### 7.2 Index Generation Rules
-
-The Short Key Index uses the first 36 bytes as the prefix index for this row 
of data. Prefix indexes are simply truncated when a VARCHAR type is encountered.
-
-### 7.3 Application Cases
-
-(1) The prefix index of the following table structure is user_id(8Byte) + 
age(4Bytes) + message(prefix 24 Bytes).
-
-![img](/images/blogs/storage/C7EC885556D24E8587BC37E6EC70930B.png)
-
-(2) The prefix index of the following table structure is user_name (20 Bytes). 
Even if it does not reach 36 bytes, because VARCHAR is encountered, it is 
directly truncated and will not continue further.
-
-![img](/images/blogs/storage/60C96B2D06D64E58A0B33384A59A0936.png)
-
-When our query condition is the prefix of the prefix index, the query speed 
can be greatly accelerated. For example, in the first example, we execute the 
following query:
-
-```sql
-SELECT * FROM table WHERE user_id=1829239 and age=20;
-```
-
-This query will be much more efficient than the following query:
-
-```sql
-SELECT * FROM table WHERE age=20;
-```
-
-Therefore, when building a table, choosing the correct column order can 
greatly improve query efficiency.
-
-## 8 ZoneMap Index index
-
-The ZoneMap index stores the statistics of the Segment and each column 
corresponding to each Page. These statistics can help speed up the query and 
reduce the amount of scanned data. The statistics include the maximum value of 
Min, the minimum value of Max, HashNull null value, and HasNotNull not all null 
information.
-
-### 8.1 Storage structure
-
-The index storage structure of ZoneMap is shown in the following figure:
-
-![img](/images/blogs/storage/6abc0dd9922ec1768e127d4e94030731.png)
-
-In the SegmentFootPB structure, each column of index metadata ColumnIndexMeta 
stores the ZoneMapIndex index data information of the current column. 
ZoneMapIndex has two parts, SegmentZoneMap and PageZoneMaps. SegmentZoneMap 
stores the global ZoneMap index information of the current Segment, and 
PageZoneMaps stores the ZoneMap index information of each Data Page.
-
-PageZoneMaps corresponds to the IndexedColumnMeta structure of the Page 
information stored in the index data. Currently, there is no compression in the 
implementation, and the encoding method is also Plain. The OrdinalIndexPage in 
IndexedColumnMeta points to the offset and size of the root page of the index 
data. The second-level Page optimization is also done here. When there is only 
one DataPage, OrdinalIndexMeta directly points to this DataPage; when there are 
multiple DataPages, Ordi [...]
-
-### 8.2 Index Generation Rules
-
-Doris opens the ZoneMap index for the key column by default; when the model of 
the table is DUPULCATE, the ZoneMap index is enabled for all fields. When the 
column data is written to the Page, the data is automatically compared, and the 
index information of the ZoneMap of the current Segment and the ZoneMap of the 
current Page is continuously maintained.
-
-### 8.3 Application Cases
-
-During data query, the fields that will be filtered according to the range 
conditions will select the scanned data range according to the ZoneMap 
statistics. For example, in case 1, filter on the age field. The query 
statement is as follows:
-
-```sql
-SELECT * FROM table WHERE age > 20 and age < 1000
-```
-
-If the Short Key Index is not hit, it will use the ZoneMap index to find the 
ordinary range of data that should be scanned according to the query conditions 
of age in the conditional statement, reducing the number of pages to be scanned.
-
-## 9 BloomFilter
-
-Doris provides BloomFilter index when some fields cannot use Short Key Index 
and the field has a high degree of discrimination.
-
-### 9.1 Storage structure
-
-The storage structure of BloomFilter is shown in the following figure:：
-
-![img](/images/blogs/storage/dc49cfbc6dc5ac90fcc45c2b2bce54d4.png)
-
-The BloomFilterIndex information stores the produced Hash strategy, Hash 
algorithm and the corresponding data Page information of BloomFilter. Hash 
algorithm adopts HASH_MURMUR3, Hash strategy adopts BlockSplitBloomFilter block 
implementation strategy, and the expected false positive rate fpp is configured 
to be 0.05 by default.
-
-The storage of data pages corresponding to BloomFilter index data is similar 
to that of ZoneMapIndex, and the optimization of secondary pages has been done, 
which will not be described in detail here.
-
-### 9.2 Index Generation Rules
-
-BloomFilter is generated by Page granularity. When data is written to a 
complete Page, Doris will generate the BloomFilter index data of this Page at 
the same time according to the Hash strategy. Currently bloom filter does not 
support tinyint/hll/float/double types, other types are already supported. When 
using, you need to specify bloom_filter_columns in PROPERTIES The fields to be 
indexed by BloomFilter.
-
-### 9.3 Application Cases
-
-When querying data, the query conditions are filtered in the field with bloom 
filter set. When the bloom filter is not hit, it means that there is no such 
data in the page, which can reduce the number of pages to be scanned.
-
-Case: The schema of the table is as follows:
-
-![img](/images/blogs/storage/2D89E0227253499AAFB77477B64DC2E5.png)
-
-The query sql here is as follows:
-
-```sql
-SELECT * FROM table WHERE name = 'Zhang San'
-```
-
-Due to the high degree of discrimination of name, in order to improve the 
query performance of sql, a BloomFilter index, PROPERTIES ( 
"bloom_filter_columns" = "name" ), is added to the name data. At query time, 
the BloomFilter index can filter out a large number of Pages.
-
-## 10 Bitmap Index index
-
-Doris also provides BitmapIndex to speed up data queries.
-
-## 10.1 Storage structure
-
-Bitmap storage format is as follows:
-
-![img](/images/blogs/storage/3001a1785a41628cd88c6e2928290d2f.png)
-
-The meta information of BitmapIndex is also stored in SegmentFootPB. 
BitmapIndex includes three parts, BitMap type, dictionary information 
DictColumn, and bitmap index data information BitMapColumn. Among them, 
DictColumn and BitMapColumn correspond to the IndexedColumnData structure, and 
store the Page address offset and size of dictionary data and index data 
respectively. The optimization of the secondary page is also done here, and 
will not be explained in detail.
-
-The difference between this and other index storage structures is that the 
DictColumn dictionary data is LZ4F compressed, and the first value in the Data 
Page is stored when the secondary Page offset is recorded.
-
-### 10.2 Index Generation Rules
-
-When creating a BitMap, it needs to be created through CREATE INDEX. The index 
of the Bitmap is the index of the Column field in the entire Segment, rather 
than generating a separate copy for each Page. When writing data, a map 
structure is maintained to record the row number corresponding to each key 
value, and the Roaring bitmap is used to encode the rowid. The main structure 
is as follows:
-
-![img](/images/blogs/storage/e9a2a4defc1204c507c0b9359225650f.png)
-
-When generating index data, the dictionary data is first written, and the key 
value of the map structure is written into the DictColumn. Then, the key 
corresponds to the Roaring-encoded rowid to write data into BitMapColumn in 
bytes.
-
-### 10.3 Application Cases
-
-When querying data, bitmap indexes can be used to optimize data columns with 
small degrees of differentiation and small column cardinality. For example, 
gender, marriage, geographic information, etc.
-
-Case: The schema of the table is as follows:
-
-![img](/images/blogs/storage/EAD7EEF330B048BC8C1EBD8EF4842772.png)
-
-The query sql here is as follows:
-
-```sql
-SELECT * FROM table WHERE city in ("Beijing", "Shanghai")
-```
-
-Since the value of city is relatively small, after the data dictionary and 
bitmap are established, matching rows can be quickly found by scanning the 
bitmap. And after bitmap compression, the amount of data itself is small, and 
the entire column can be accurately matched by scanning less data.
-
-## 11 Index query process
-
-When querying data in a Segment, according to the query conditions executed, 
the data will be filtered first according to the field indexing. Then read the 
data, the overall query process is as follows:
-
-![img](/images/blogs/storage/e2c62616c1c12fa05457eb6c443ebc48.png)
-
-1. First, a row_bitmap will be constructed according to the number of rows in 
the Segment, indicating that the data needs to be read to record. If no index 
is used, all data needs to be read.
-2. When the key is used in the query condition according to the prefix index 
rule, the ShortKey Index will be filtered first, and the ordinal row number 
range matched in the ShortKey Index can be merged into the row_bitmap.
-3. When there is a BitMap Index index in the column field in the query 
condition, the ordinal row number that meets the conditions will be directly 
found according to the BitMap index, and the intersection filter with 
row_bitmap will be obtained. The filtering here is accurate, and after removing 
the query condition, this field will not be filtered by the subsequent index.
-4. When there is a BloomFilter index in the column field in the query 
condition and the condition is equal (eq, in, is), it will be filtered by the 
BloomFilter index, here will go through all the indexes, filter the BloomFilter 
of each Page, and find out the query condition can be All Pages hit. Intersect 
the ordinal row number range in the index information with row_bitmap.
-5. When there is a ZoneMap index in the column field in the query condition, 
it will be filtered by the ZoneMap index. Here, all the indexes will also be 
traversed to find all the pages that the query condition can intersect with the 
ZoneMap. Intersect the ordinal row number range in the index information with 
row_bitmap.
-6. After the row_bitmap is generated, find the specific Data Page in batches 
through the OrdinalIndex of each Column.
-7. Batch read the data of the Column Data Page of each column. When reading, 
for a page with a null value, judge whether the current row is null according 
to the null value bitmap. If it is null, it can be filled directly.
-
-## 12 Summary
-
-Doris currently adopts a complete column storage structure and provides rich 
indexes to deal with different query scenarios, laying a solid foundation for 
Doris's efficient writing and query performance. The Doris storage layer is 
designed to be flexible, and functions such as new indexes and enhanced data 
deletion can be further added in the future.
diff --git a/blogs/en/DorisInternals/doris-storage-writer-delete.md 
b/blogs/en/DorisInternals/doris-storage-writer-delete.md
deleted file mode 100644
index e6b6efbf545..00000000000
--- a/blogs/en/DorisInternals/doris-storage-writer-delete.md
+++ /dev/null
@@ -1,150 +0,0 @@
----
-{
-  "title": "Apache Doris storage layer design two write process, delete 
process analysis",
-  "description": "This article introduces in detail the internal 
implementation process of the Doris system during the data writing process, as 
well as the implementation process of Doris's conditional deletion of data and 
batch deletion by key.",
-  "date": "2022-05-20",
-  "metaTitle": "Apache Doris storage layer design two write process, delete 
process analysis",
-  "isArticle": true,
-  "language": "en",
-  "author": "ApacheDoris",
-  "layout": "Article",
-  "sidebar": false,
-  "categories": "DorisInternals",
-}
----
-
-<!--
-Licensed to the Apache Software Foundation (ASF) under one
-or more contributor license agreements.  See the NOTICE file
-distributed with this work for additional information
-regarding copyright ownership.  The ASF licenses this file
-to you under the Apache License, Version 2.0 (the
-"License"); you may not use this file except in compliance
-with the License.  You may obtain a copy of the License at
-
-  http://www.apache.org/licenses/LICENSE-2.0
-
-Unless required by applicable law or agreed to in writing,
-software distributed under the License is distributed on an
-"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
-KIND, either express or implied.  See the License for the
-specific language governing permissions and limitations
-under the License.
--->
-
-# Apache Doris storage layer design two write process, delete process analysis
-
-## 1. Overall introduction
-
-Doris is an interactive SQL data warehouse based on MPP architecture, mainly 
used to solve near real-time reporting and multidimensional analysis. Doris's 
efficient import and query are inseparable from the sophisticated design of its 
storage structure.
-
-This article mainly analyzes the implementation principle of the storage layer 
of the Doris BE module by reading the code of the Doris BE module, and expounds 
and decrypts the core technology behind the efficient writing and query 
capabilities of Doris. It includes Doris column storage design, index design, 
data read and write process, Compaction process, version management of Tablet 
and Rowset, data backup and other functions.
-
-This article introduces the storage layer structure of the Segment V2 version, 
including rich functions such as ordered storage, sparse index, prefix index, 
bitmap index, BloomFilter, etc., which can provide extremely fast query 
capabilities for various complex scenarios.
-
-This article introduces in detail the internal implementation process of the 
Doris system during the data writing process, as well as the implementation 
process of Doris's conditional deletion of data and batch deletion by key
-
-## 2 Glossary
-
-- **FE:** Frontend, the front-end node of Doris. It is mainly responsible for 
receiving and returning client requests, metadata, cluster management, and 
query plan generation.
-- **BE:** Backend, the backend node of Doris. Mainly responsible for data 
storage and management, query plan execution and other work.
-- **Tablet:** Tablet is the actual physical storage unit of a table. A table 
is stored in units of Tablet in the distributed storage layer formed by BE 
after partitioning and bucketing. Each Tablet includes meta information and 
several a continuous RowSet.
-- **Rowset: ** Rowset is the data set of a data change in the Tablet, and the 
data change includes data import, deletion, update, etc. Rowset records by 
version information. A version is generated for each change.
-- **Version:** consists of two attributes, Start and End, and maintains the 
record information of data changes. Usually used to indicate the version range 
of Rowset, after a new import generates a Rowset with equal Start and End, and 
after Compaction generates a Rowset version with a range.
-- **Segment:** Indicates the data segment in the Rowset. Multiple Segments 
form a Rowset.
-- **Compaction:** The process of merging consecutive versions of Rowset is 
called Compaction, and the data is compressed during the merging process.
-
-## 3 Write process
-
-Doris supports various forms of data writing methods for different scenarios, 
including importing Broker Load from other storage sources, importing HTTP 
synchronous data into Stream Load, routine Routine Load import and Insert Into 
writing, etc. At the same time, the import process will involve FE module 
(mainly responsible for import plan generation and import task scheduling), BE 
module (mainly responsible for ETL and storage of data), and Broker module 
(providing Doris with the abilit [...]
-
-The following takes Stream Load writing as an example to describe the overall 
data writing process of Doris as shown in the following figure:
-
-![img](/images/blogs/storage/04ebc864ee5fcc9f0e3c51347af5b8cf.png)
-
-The process is described as follows:
-
-1. FE receives the user's write request and randomly selects BE as the 
Coordinator BE. Redirect the user's request to this BE.
-2. The Coordinator BE is responsible for receiving the user's data write 
request, and at the same time requesting the FE to generate an execution plan 
and schedule and manage the import task LoadJob and import transaction.
-3. The Coordinator BE schedules the execution of the import plan, and performs 
data verification and cleaning.
-4. The data is written to the storage layer of the BE. In this process, it 
will be written to the memory first, and after a certain amount of data is 
filled, it will be written to the physical disk according to the data format of 
the storage layer.
-
-This article mainly introduces the detailed process of writing data to the BE 
storage layer. The rest of the process is not described in detail.
-
-### 3.1 Data distribution process
-
-After the data is cleaned and filtered, the data will be sent to the BE nodes 
of the storage layer in batches through Open/AddBatch requests. Multiple 
LoadJob tasks are supported concurrently for concurrent write execution on a 
BE. LoadChannelMgr manages these tasks and distributes the data.
-
-The data distribution and writing process is shown in the following figure:
-
-![img](/images/blogs/storage/225e6c7dba4c85c30ab3d00c0519e24a.png)
-
-1. Each time an import task LoadJob will create a LoadChannel to execute, 
LoadChannel maintains an imported channel, and LoadChannel can write data in 
batches until the import is complete.
-
-2. LoadChannel will create a TabletsChannel to perform specific import 
operations. A TabletsChannel corresponds to multiple Tablets. In a batch data 
write operation, TabletsChannel distributes the data to the corresponding 
Tablet, and the DeltaWriter writes the data to the Tablet, and the real write 
operation begins.
-
-### 3.2 DeltaWriter and Memtable
-
-DeltaWriter is mainly responsible for continuously receiving newly written 
batches of data and completing the data writing of a single Tablet. Since the 
new data can be incremental Delta parts, it is called DeltaWriter.
-
-DeltaWriter uses an LSM tree-like structure for data writing. The data is 
first written to the Memtable. When the Memtable data is full, it will 
asynchronously flush to generate a Segment for persistence, and at the same 
time generate a new Memtable to continue to receive new data for import. This 
flush operation is done by the MemtableFlushExecutor executor.
-
-In Memtable, the skip table structure is used to sort the data, and the 
sorting rule uses the order of the keys of the schema to compare the fields in 
turn. This ensures that the data written in each write segment is ordered. If 
the current model is a non-DUP model (AGG model and UNIQUE model), the data of 
the same key will also be aggregated.
-
-### 3.3 Physical Write
-
-#### 3.3.1 RowsetWriter module design
-
-Writing at the physical storage level is done by RowsetWriter. RowsetWriter is 
further divided into sub-modules such as SegmentWriter, ColumnWriter, 
PageBuilder, and IndexBuilder.
-
-1.  RowsetWriter completes the writing of an import LoadJob task as a whole, 
and an import LoadJob task will generate a Rowset, and a Rowset represents the 
data version that is successfully imported once. In implementation, 
RowsetWriter is responsible for completing the writing of Rowset.
-2.  SegmentWriter is responsible for implementing Segment writing. A Rowset 
can consist of multiple Segment files.
-3.  ColumnWriter is included in SegmentWriter. The segment file is a complete 
column storage structure. Segment contains each column and related index data. 
The writing of each column is responsible for writing by ColumnWriter.
-4.  In the file storage format, data and indexes are organized by Page, and 
ColumnWriter includes PageBuilder for generating data Page and IndexBuilder for 
generating index Page to complete the writing of Page.
-5.  Finally, FileWritableBlock is responsible for reading and writing specific 
files. For the storage format of the file, please refer to the document 
"Introduction to Doris Storage Layer Design 1 - Analysis of Storage Structure 
Design".
-
-#### 3.3.2 RowsetWriter writing process
-
-The overall physical writing is shown in the following figure:
-
-![img](/images/blogs/storage/8e136044dcc7b75df037a7a211006e9d.png)
-
-Detailed description of the physical write process:
-
-1. When a Memtable is full (the default is 100M), the data in the Memtable 
will be flushed to the disk, and the data in the Memtable will be ordered by 
key. It is then written to the RowsetWriter row by row.
-2. The RowsetWriter also writes the data line by line to the SegmentWriter, 
and the RowsetWriter maintains the currently being written SegmentWriter and 
the list of file blocks to be written. Each time a segment is written, a file 
block will be added.
-3. SegmentWriter writes data to each ColumnWriter row by row, and writes 
ShortKeyIndexBuilder at the same time. ShortKeyIndexBuilder is mainly 
responsible for generating the index Page of ShortKeyIndex. For the specific 
ShortKeyIndex index format, please refer to the document "Introduction to Doris 
Storage Layer Design 1 - Storage Structure Design Analysis".
-4. ColumnWriter writes data into PageBuilder and each IndexBuilder 
respectively. PageBuilder is used to generate PageBuilder for ColumnData data. 
Each IndexBuilder includes (OrdinalIndexBuilder generates Page format of 
OrdinalIndex row number sparse index, ZoneMapIndexBuilder generates Page format 
of ZoneMapIndex index, BitMapIndexBuilder generates BitMapIndex index Page 
format, BloomFilterIndexBuilder generates the Page format of the 
BloomFilterIndex index). For details, refer to Doris  [...]
-5. After adding data, the RowsetWriter performs a flush operation.
-6. The flush operation of SegmentWriter writes data and indexes to disk. The 
read and write to the disk is done by FileWritableBlock.
-7. ColumnWriter writes the respective data and pages generated by the index to 
the file in sequence.
-8. SegmentWriter generates SegmentFooter information, and SegmentFooter 
records the original data information of the Segment file. After completing the 
write operation, RowsetWriter will start a new SegmentWriter and write the next 
Memtable to the new Segment until the import is complete.
-
-### 3.4 Posted by Rowset
-
-When the data import is complete, DeltaWriter will publish the newly generated 
Rowset. The release is to set the Rowset of this version to the visible state, 
indicating that the imported data has become effective and can be queried. The 
version information indicates the order in which the Rowset takes effect. An 
import will generate a Rowset, and each time the import is successful, the 
version will be increased in order. The entire release process is as follows:
-
-1. DeltaWriter counts the current RowsetMeta metadata information, including 
the number of rows, bytes, time, and segments.
-2. Save to RowsetMeta and submit the import transaction to FE. The current 
import transaction is opened by FE to ensure that the data of each BE node is 
imported at the same time and takes effect at the same time.
-3. After the FE is coordinated, the FE will issue a Publish task to make the 
imported Rowset version take effect. The release's effective version version 
information is specified in the task. Only then will the BE storage layer make 
this version of the Rowset visible.
-4. Rowset is added to the Tablet of the BE storage layer for management.
-
-## 4 delete process
-
-At present, there are two implementations of Delete, a common delete type is 
DELETE, and the other is LOAD_DELETE.
-
-### 4.1 DELETE execution flow
-
-DELETE supports general deletion operations, and the implementation is 
relatively simple. In DELETE mode, there is no actual deletion of data, but 
data deletion conditions are recorded. Stored in Meta information. Delete 
conditions are incorporated into the Base version together when Base Compaction 
is performed. The Base version is the first Rowset data version of the Tablet 
from [0-x]. The specific process is as follows:
-
-1. When deleting, the FE will directly issue the delete command and delete 
conditions.
-2. BE starts an EngineBatchLoadTask task locally, generates a new version of 
Rowset, and records the deletion condition information. The Rowset of this 
deletion record is slightly different from that of the writing process. The 
Rowset only records the deletion condition information without actual data.
-3. FE also publishes the effective version. The Rowset will be added to the 
Tablet and the TabletMeta information will be saved.
-
-### 4.2 LOAD_DELETE execution flow
-
-LOAD_DELETE supports the ability to delete data by importing the keys to be 
deleted in batches under the UNIQUE KEY model, which can support large-scale 
data deletion. The overall idea is to add a deletion status flag to the data 
record, and the deleted key will be compressed in the Compaction process. 
Compaction is mainly responsible for merging multiple Rowset versions, and the 
Compaction process will be described in detail in subsequent articles.
-
-## 5 Summary
-
-This article introduces the writing process and deletion process of the 
underlying storage layer of the Doris system in detail. It first describes the 
overall writing process of Doris, and then analyzes in detail the design of 
Doris's LSM-like storage structure, the data distribution and physical writing 
process in the memory part, the Rowset version release and other processes, and 
finally introduces the two supported by Doris. Data deletion method。


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@doris.apache.org
For additional commands, e-mail: commits-h...@doris.apache.org

[incubator-doris-website] branch master updated: remove

Reply via email to