This is an automated email from the ASF dual-hosted git repository. jiafengzheng pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/incubator-doris-website.git
The following commit(s) were added to refs/heads/master by this push: new 39fe56efe8f remove 39fe56efe8f is described below commit 39fe56efe8f8961190e6f1568bce5d63469ceef2 Author: jiafeng.zhang <zhang...@gmail.com> AuthorDate: Fri Jun 10 17:08:16 2022 +0800 remove remove --- .../doris-storage-reader-compaction.md | 228 ---------------- .../DorisInternals/doris-storage-struct-design.md | 303 --------------------- .../DorisInternals/doris-storage-writer-delete.md | 150 ---------- 3 files changed, 681 deletions(-) diff --git a/blogs/en/DorisInternals/doris-storage-reader-compaction.md b/blogs/en/DorisInternals/doris-storage-reader-compaction.md deleted file mode 100644 index 7672ad38eb8..00000000000 --- a/blogs/en/DorisInternals/doris-storage-reader-compaction.md +++ /dev/null @@ -1,228 +0,0 @@ ---- -{ - "title": "Apache Doris storage layer design three reading process, Compaction process analysis", - "description": "This article introduces in detail the internal implementation process of the Doris system during the data writing process, as well as the implementation process of Doris's conditional deletion of data and batch deletion by key.", - "date": "2022-05-20", - "metaTitle": "Apache Doris storage layer design three reading process, Compaction process analysis", - "isArticle": true, - "language": "en", - "author": "ApacheDoris", - "layout": "Article", - "sidebar": false, - "categories": "DorisInternals", -} ---- - -<!-- -Licensed to the Apache Software Foundation (ASF) under one -or more contributor license agreements. See the NOTICE file -distributed with this work for additional information -regarding copyright ownership. The ASF licenses this file -to you under the Apache License, Version 2.0 (the -"License"); you may not use this file except in compliance -with the License. You may obtain a copy of the License at - - http://www.apache.org/licenses/LICENSE-2.0 - -Unless required by applicable law or agreed to in writing, -software distributed under the License is distributed on an -"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY -KIND, either express or implied. See the License for the -specific language governing permissions and limitations -under the License. ---> - -# Apache Doris storage layer design three reading process, Compaction process analysis - -## 1 Overall introduction - -Doris is an interactive SQL data warehouse based on MPP architecture, mainly used to solve near real-time reports and multi-dimensional analysis. The efficient import and query of Doris is inseparable from the sophisticated design of its storage structure. - -This article mainly analyzes the implementation principle of the storage layer of the Doris BE module by reading the code of the Doris BE module, and expounds and decrypts the core technology behind the efficient writing and query capabilities of Doris. It includes Doris column storage design, index design, data read and write process, Compaction process and other functions. - -This article introduces in detail the internal implementation process of the Doris system during the data writing process, as well as the implementation process of Doris's conditional deletion of data and batch deletion by key. - -## 2 Read process - -### 2.1 Overall reading process - -The read process is the reverse process of writing, but the read process is relatively complicated, mainly because of a large number of read optimizations. The entire reading process is divided into two stages, one is the init process, and the other is the process of obtaining the next_block data block. The specific process is shown in the following figure: - - - -The hierarchical relationship is as follows: - -OlapScanner encapsulates the overall read operation of a tablet data; - -Reader processes the read parameters, and provides differentiated processing for reading according to three different models; - -CollectIterator contains multiple RowsetReaders in the tablet. These RowsetReaders have version order. CollectIterator merges these RowsetReaders into a unified Iterator function and provides a merged comparator; - -RowsetReader is responsible for reading a Rowset; - -RowwiseIterator provides an Iterator function for unified access to all Segments in a Rowset. The merge strategy here can use Merge or Union according to the data sorting; - -SegmentIterator corresponds to the data read of a segment. The segment read will calculate the corresponding line number information read according to the query conditions and the index, seek to the corresponding page, and read the data. Among them, after filtering conditions, a bitmap will be generated for the accessible row information to record, BitmapRangeIterator is a separate iterator that can access this bitmap according to the range; - -ColumnIterator provides an iterator for uniform access to a column's related data and indexes. ColumnReader, each IndexReader, etc. correspond to the reading of specific data and index information. - -### 2.2 The main process of reading the Init stage - -The execution flow of the initialization phase is as follows: - - - -#### 2.2.1 OlapScanner query parameter construction - -Find the RowsetReader that needs to be read according to the version specified by the query (depending on the rowset_graph version path map of version management to obtain the shortest path of the query version range); - -1. Set the query information, including \_tablet, read type reader_type=READER_QUERY, whether to aggregate, \_version (from 0 to the specified version); - -2. Set query condition information, including filter field and is_nulls field; - -3. Set the return column information; - -4. Set the key_ranges range of the query (the range array of keys, which can be filtered by short key index); - -5. Initialize the Reader object. - -#### 2.2.2 Reader's Init process - -1. Initialize the conditions query condition object; - -2. Initialize the bloomFilter column set (eq, in conditions, columns with bloomFilter added); - -3. Initialize delete_handler. It includes all the deletion information existing in the tablet, including the version and the corresponding deletion condition array; - -4. Initialize the columns that are passed to the lower layer to be read and returned, including the return value and the columns in the condition object; - -5. Initialize the RowCusor row cursor object corresponding to the start key and end key of key_ranges; - -6. Set up RowsetReader and CollectIterator for the constructed information. The Rowset object is initialized, and the RowsetReader is added to the CollectIterator; - -7. Call CollectIterator to get the current row (actually the first row here), start the reading process here, and read it for the first time. - -#### 2.2.3 Init process of RowsetReader - -Build a SegmentIterator and filter out delete conditions in delete_handler that are smaller than the current Rowset version; - -Build a RowwiseIterator (an aggregate iterator for SegmentIterator), and add the SegmentIterator to be read to the RowwiseIterator. When all segments are in overall order, the sequential reading method of union iterator is adopted, otherwise, the merge iterator method of merged reading is adopted. - -#### 2.2.4 Init process of Segmentlterator - -1. Initialize the ReadableBlock, which is used to read the object of the current Segment file, and actually read the file; - -2. Initialize \_row_bitmap to store the row number filtered by the index, using the bitmap structure; - -3. Build a ColumnIterator, where only the columns need to be read; - -If the Column has a BitmapIndex index, initialize the BitmapIndexIterator of each Column; - -Filter data by SortkeyIndex index. When the query has key_ranges, obtain the row number range of the hit data through key_range. The steps are as follows: (1) According to the upper and lower keys of each key_range, find the corresponding row numbers upper_rowid and lower_rowid through the SortkeyIndex index of Segment, and then merge the obtained RowRanges into row_bitmap; - -Filter data conditionally by various indexes. Conditions include query conditions and delete conditions to filter information. - -- According to the query conditions, use the bitmap index to filter the columns that contain the bitmap index in the condition, and query the row number list with the existing data to intersect the row_bitmap. Because it is precise filtering, delete the filtering conditions from the Condition object. -- Use the BloomFilter index to filter data according to the equivalent (eq, in, is) conditions in the query conditions. Here, it will be judged whether the current condition can hit the Page, and the row number range of this Page will be intersected with the row_bitmap. -- Use ZoneMapIndex to filter data according to query conditions and deletion conditions, and find the pages that meet the conditions by intersecting the index of each Page in ZoneMap. The range of row numbers matched by the ZoneMapIndex index is intersected with row_bitmap. - -Use row_bitmap to construct a BitmapRangerInterator iterator for subsequent reading of data. - -### 2.3 The main process of reading the next stage - -The execution flow of the next stage is as follows: - - - -#### 2.3.1 Reader reads next_row_with_aggregation - -Read a line in advance when the reader reads, record as the current line. When next is called to return the result, the current row will be returned, and then the next row will be prefetched as the new current row. - -The reading of the reader will be divided into three cases according to the type of the model - -\_dup_key_next_row reads (detailed data model), returns the current row, and then directly reads CollectorIterator to read next as the current row; - -Under \_agg_key_next_row reading (aggregation model), after taking CollectorIterator to read next, determine whether the next row is the same as the key of the current row, if it is the same, perform aggregation calculation, and read the next row in a loop; if not, return the current accumulated aggregation result, update the current row; - -Under \_unique_key_next_row reading (unique key model), the logic is the same as the \_agg_key_next_row model, but there are some differences. Since the delete operation is supported, it will check whether the current row after aggregation is marked as a deleted row. If data is discarded for a deleted row, it will not be returned until a data is found that is not a deleted row. - -#### 2.3.2 CollectIterator reads next - -CollectIterator uses the heap data structure to maintain the set of RowsetReaders to be read. The comparison rules are as follows: According to the order of the keys of the current row of each RowsetReader, when the keys are the same, compare the version of the Rowset. - -CollectIterator pops the previous largest RowsetReader from the heap; - -Read the next new row for the RowsetReader just popped out as the current row of the RowsetReader and put it into the heap for comparison. During the reading process, the nextBlock of RowsetReader is called to read by RowBlock. (If the currently fetched block is a partially deleted page, the current row is also filtered according to the deletion condition.) - -Get the current row of the RowsetReader at the top of the queue and return it as the current row. - -#### 2.3.3 RowsetReader reads next - -RowsetReader directly reads next_batch of RowwiseIterator; - -RowwiseIterator integrates SegmentIterator. When the Segments in the Rowset are ordered as a whole, iteratively returns directly in the Union mode. When out of order, return by Merge. RowwiseIterator also returns the row data of the current largest SegmentIterator, and each time the next_batch of SegmentIterator is called to get the data. - -#### 2.3.4 SegmentIterator reads next_batch - -According to the BitmapRangerInterator constructed in the init phase, use next_range to take out a range_from, range_to of the line number to be read each time; - -First read the data of the condition column from range_from to range_to row. The process is as follows: - -Call the seek_to_ordinal of each columnIterator of the conditional column, and the current_rowid of the read position of each column is located to the cur_rowid of the SegmentIterator. Here is the alignment to the corresponding data page by binary check ordinal_index. - -Read the data of the condition column. Do one more filter by condition (this time exact filter). - -Then read the data of the unconditional column, put it into the Rowblock, and return to the Rowblock. - -## 3 Compaction process - -### 3.1 Overall Introduction of Compaction - -Doris improves the performance of incrementally aggregated Rowset files through Compaction. In the version information of Rowset, two fields, first and second, are designed to represent the merged version range of Rowset. When the versions first and second of the unmerged cumulative rowset are equal. During Compaction, adjacent Rowsets will be merged to generate a new Rowset, and the first and second of the version information will also be merged into a larger version. On the other hand, [...] - - - -As shown in the figure above, there are two types of Compaction tasks, base compaction and cumulative compaction. The cumulative_point is the key to dividing the two strategies. - -It can be understood in this way that the right side of cumulative_point is the incremental Rowset that has never been merged, and the first and second versions of each Rowset are equal; the left side of cumulative_point is the merged Rowset, and the first version is not equal to the second version. The base compaction and cumulative compaction task processes are basically the same, and the difference is only in the logic of selecting the InputRowset to be merged. - -### 3.2 Detailed process of Compaction - -The overall process of Compaction merger is shown in the following figure: - - - -#### 3.2.1 Calculate cumulative_point - -Select the set of InputRowsets that need to be merged for compaction: - -Base compaction selection conditions: - -1. When there are more than 5 non-cumulative rowsets, merge all non-cumulative rowsets; - -2. When the ratio of the base rowset whose version first is 0 and other non-cumulative disks is less than 10:3, merge all non-cumulative rowsets for merging; - -3. In other cases, the merger will not be carried out. - -Selection criteria for cumulative compaction: - -1. The number of segments in the selected Rowset set needs to be greater than or equal to 5 and less than or equal to 1000 (configurable), and merge; 2. -2. When the number of output Rowsets is less than 5, but the deletion condition version is greater than the Rowset second version, merge (let the deleted Rowsets be merged in quickly); -3. When both the accumulated base compaction and cumulative compaction time are greater than 1 day, merge; -4. Other cases are not combined. - -#### 3.2.2 Execute compaction - -Compaction execution can basically be understood as a read process plus a write process. Here, the Reader will be turned on for the inputRowsets to be merged, and then the records will be read through next_row_with_aggregation. Write to the output RowsetWriter to produce a new OutputRowset. The version of this Rowset is the full range of the InputRowsets version. - -#### 3.2.3 update cumulative_point - -Update cumulative_point and pass the OutputRowset produced by cumulative compaction to the subsequent base compaction process. - -After Compaction, the aggregation key model and the unique key model scattered in different Rowsets but with the same key data are merged to achieve the effect of pre-computing. At the same time, the number of Rowset files is reduced, and the query efficiency is improved. - -## 4 Summary - -This article introduces the read-related process of the underlying storage layer of the Doris system in detail. - -The reading process depends on the complete column storage implementation. For OLAP wide table scenarios (reading a large number of rows, a small number of columns), it can quickly scan and filter based on various index functions (including short key, bloom filter, zoon map, bitmap, etc. ), which can skip a large number of data scans, and optimizes such as delayed materialization, which can correspond to data analysis in various scenarios; the Compaction execution process is also optimiz [...] diff --git a/blogs/en/DorisInternals/doris-storage-struct-design.md b/blogs/en/DorisInternals/doris-storage-struct-design.md deleted file mode 100644 index 7c3e501c9fa..00000000000 --- a/blogs/en/DorisInternals/doris-storage-struct-design.md +++ /dev/null @@ -1,303 +0,0 @@ ---- -{ - "title": "Analysis of storage structure design one of Apache Doris storage layer design", - "description": "This article mainly analyzes the implementation principle of the storage layer of the Doris BE module by reading the code of the Doris BE module, and expounds and decrypts the core technology behind the efficient writing and query capabilities of Doris. It includes Doris column storage design, index design, data read and write process, Compaction process, version management of Tablet and Rowset, data backup and other functions.", - "date": "2022-05-20", - "metaTitle": "Analysis of storage structure design one of Apache Doris storage layer design", - "isArticle": true, - "language": "en", - "author": "ApacheDoris", - "layout": "Article", - "sidebar": false, - "categories": "DorisInternals", -} ---- - -<!-- -Licensed to the Apache Software Foundation (ASF) under one -or more contributor license agreements. See the NOTICE file -distributed with this work for additional information -regarding copyright ownership. The ASF licenses this file -to you under the Apache License, Version 2.0 (the -"License"); you may not use this file except in compliance -with the License. You may obtain a copy of the License at - - http://www.apache.org/licenses/LICENSE-2.0 - -Unless required by applicable law or agreed to in writing, -software distributed under the License is distributed on an -"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY -KIND, either express or implied. See the License for the -specific language governing permissions and limitations -under the License. ---> - -# Analysis of storage structure design one of Apache Doris storage layer design - -## 1. Overall introduction - -Doris is an interactive SQL data warehouse based on MPP architecture, mainly used to solve near real-time reporting and multidimensional analysis. Doris's efficient import and query are inseparable from the sophisticated design of its storage structure. - -This article mainly analyzes the implementation principle of the storage layer of the Doris BE module by reading the code of the Doris BE module, and expounds and decrypts the core technology behind the efficient writing and query capabilities of Doris. It includes Doris column storage design, index design, data read and write process, Compaction process, version management of Tablet and Rowset, data backup and other functions. - -This article introduces the storage layer structure of the Segment V2 version, including rich functions such as ordered storage, sparse index, prefix index, bitmap index, BloomFilter, etc., which can provide extremely fast query capabilities for various complex scenarios. - -## 2 Design goals - -- Bulk import, few updates -- the vast majority of read requests -- Wide table scenario, read a lot of rows, few columns -- Non-transactional scenarios -- good scalability - -## 3 save file format - -### 3.1 Storage directory structure - -The storage layer's management of storage data is configured through the storage_root_path path, which can be multiple. The next layer of the storage directory is organized according to buckets. Specific tablets are stored in the bucket directory, and subdirectories are named according to tablet_id. - -Segment files are stored in the tablet*id directory and managed according to SchemaHash. There can be multiple Segment files, generally divided according to size, the default is 256MB. Among them, the segment v2 file naming rule is: ${rowset_id}*${segment_id}.dat. - -The specific storage directory storage format is shown in the following figure: - - - -### 3.2 Segment v2 file structure - -The overall file format of Segment is divided into three parts: data area, index area and footer, as shown in the following figure: - - - -- Data Region: used to store the data information of each column, the data here is loaded in pages as needed -- Index Region: Doris stores the index data of each column uniformly in the Index Region. The data here will be loaded according to the column granularity, so it is stored separately from the column data information. -- Footer information -- SegmentFooterPB: Define the metadata information of the file -- 4 bytes checksum of FooterPB content -- 4 bytes of FileFooterPB message length for reading FileFooterPB - -The following distribution introduces the design of the storage format of each part. - -## 4 Footer Information - -The Footer information segment is at the end of the file, which stores the overall structure of the file, including the location of the data field, the location of the index field and other information, including SegmentFooterPB, CheckSum, Length, MAGIC CODE 4 parts. - -SegmentFooterPB data structure is as follows: - - - -SegmentFooterPB uses the PB format for storage, which mainly includes the meta information of the column, the meta information of the index, the short key index information of the segment, and the total number of rows. - -### 4.1 Column meta information - -ColumnId: the serial number of the current column in the schema - -UniqueId: globally unique id - -Type: the type information of the column - -Length: the length information of the column - -Encoding: encoding format - -Compression: Compression format - -Dict PagePointer: Dictionary information - -### 4.2 Meta information of column index - -- OrdinalIndex: Stores the sparse index meta information of the column. -- ZoneMapIndex: Stores the meta information of the ZoneMap index, including the maximum value, the minimum value, whether there is a null value, and whether there is no non-null value. SegmentZoneMap stores the global ZoneMap information, and PageZoneMaps stores the statistical information of each page. -- BitMapIndex: Stores the meta information of BitMap index, including BitMap type, dictionary data BitMap data. -- BloomFilterIndex: Stores the BloomFilter index information. - -In order to prevent the data volume of the index itself from being too large, ZoneMapIndex, BitMapIndex, and BloomFilterIndex adopt two-level Page management. Corresponding to the structure of IndexColumnMeta, when a Page can be put down, the current Page directly stores the index data, that is, a level 1 structure is adopted; when a Page cannot be put down, the index data is written into a new Page, and the Root Page stores the address information of the data Page . - -## 5 Ordinal Index - -The Ordinal Index index provides the physical address of the Column Data Page data page by row number. Ordinal Index can align column-stored data by row, which can be understood as a first-level index. When looking for data in other indexes, the Ordinal Index is used to find the location of the data Page. Therefore, the Ordinal Index index is introduced here first. - -In a segment, data is always stored in the sorted order of keys (AGGREGATE KEY, UNIQ KEY, and DUPLICATE KEY), that is, the sorting of keys determines the physical structure of data storage. The physical structure order of the column data is determined. When writing data, the Column Data Page is managed by the Ordinal index. The Ordinal index records the position offset, size and row number information of the first data item of each Column Data Page. Namely Ordinal. In this way, each colu [...] - -### 5.1 Storage structure - -Ordinal index meta information is stored in OrdinalIndexMeta for each column in SegmentFooterPB . The specific structure is shown in the following figure: - - - -The root page address corresponding to the index data is stored in OrdinalIndexMeta. Some optimizations are made here. When the data has only one page, the address here can directly point to the only data page; when a page cannot be placed, it points to the second page of the OrdinalIndex type Hierarchical structure index page, each data item in the index data corresponds to the Column Data Page offset position, size and ordinal row number information. The Ordinal index index granularity [...] - -## 6 column data store - -### 6.1 data page storage structure - -DataPage is mainly divided into two parts: Data part and Page Footer. - -The Data section stores the data of the columns of the current Page. When the Null value is allowed, the Bitmap of the Null value is stored separately for the null value, and the row number of the Null value is recorded by the RLE format encoding through the bool type. - - - -Page Footer contains Page type Type, UncompressedSize uncompressed data size, FirstOrdinal RowId of the first row of the current Page, NumValues is the number of rows of the current Page, NullMapSize corresponds to the size of NullBitmap. - -## 6.2 Data compression - -Different encodings are used for different field types. By default, the correspondences adopted for different types are as follows: - - - -The data is compressed in LZ4F format by default. - -## 7 Short Key Index Index - -### 7.1 Storage structure - -Short Key Index prefix index is an index method to quickly query data according to a given prefix column based on the sorting of keys (AGGREGATE KEY, UNIQ KEY and DUPLICATE KEY). Here, the Short Key Index index also adopts a sparse index structure. During the data writing process, an index item will be generated every certain number of rows. The number of rows is 1024 rows by default for the index granularity, which can be configured. The process is shown in the following diagram: - - - -Among them, KeyBytes stores the index item data, and OffsetBytes stores the offset of the index item in KeyBytes. - -### 7.2 Index Generation Rules - -The Short Key Index uses the first 36 bytes as the prefix index for this row of data. Prefix indexes are simply truncated when a VARCHAR type is encountered. - -### 7.3 Application Cases - -(1) The prefix index of the following table structure is user_id(8Byte) + age(4Bytes) + message(prefix 24 Bytes). - - - -(2) The prefix index of the following table structure is user_name (20 Bytes). Even if it does not reach 36 bytes, because VARCHAR is encountered, it is directly truncated and will not continue further. - - - -When our query condition is the prefix of the prefix index, the query speed can be greatly accelerated. For example, in the first example, we execute the following query: - -```sql -SELECT * FROM table WHERE user_id=1829239 and age=20; -``` - -This query will be much more efficient than the following query: - -```sql -SELECT * FROM table WHERE age=20; -``` - -Therefore, when building a table, choosing the correct column order can greatly improve query efficiency. - -## 8 ZoneMap Index index - -The ZoneMap index stores the statistics of the Segment and each column corresponding to each Page. These statistics can help speed up the query and reduce the amount of scanned data. The statistics include the maximum value of Min, the minimum value of Max, HashNull null value, and HasNotNull not all null information. - -### 8.1 Storage structure - -The index storage structure of ZoneMap is shown in the following figure: - - - -In the SegmentFootPB structure, each column of index metadata ColumnIndexMeta stores the ZoneMapIndex index data information of the current column. ZoneMapIndex has two parts, SegmentZoneMap and PageZoneMaps. SegmentZoneMap stores the global ZoneMap index information of the current Segment, and PageZoneMaps stores the ZoneMap index information of each Data Page. - -PageZoneMaps corresponds to the IndexedColumnMeta structure of the Page information stored in the index data. Currently, there is no compression in the implementation, and the encoding method is also Plain. The OrdinalIndexPage in IndexedColumnMeta points to the offset and size of the root page of the index data. The second-level Page optimization is also done here. When there is only one DataPage, OrdinalIndexMeta directly points to this DataPage; when there are multiple DataPages, Ordi [...] - -### 8.2 Index Generation Rules - -Doris opens the ZoneMap index for the key column by default; when the model of the table is DUPULCATE, the ZoneMap index is enabled for all fields. When the column data is written to the Page, the data is automatically compared, and the index information of the ZoneMap of the current Segment and the ZoneMap of the current Page is continuously maintained. - -### 8.3 Application Cases - -During data query, the fields that will be filtered according to the range conditions will select the scanned data range according to the ZoneMap statistics. For example, in case 1, filter on the age field. The query statement is as follows: - -```sql -SELECT * FROM table WHERE age > 20 and age < 1000 -``` - -If the Short Key Index is not hit, it will use the ZoneMap index to find the ordinary range of data that should be scanned according to the query conditions of age in the conditional statement, reducing the number of pages to be scanned. - -## 9 BloomFilter - -Doris provides BloomFilter index when some fields cannot use Short Key Index and the field has a high degree of discrimination. - -### 9.1 Storage structure - -The storage structure of BloomFilter is shown in the following figure:: - - - -The BloomFilterIndex information stores the produced Hash strategy, Hash algorithm and the corresponding data Page information of BloomFilter. Hash algorithm adopts HASH_MURMUR3, Hash strategy adopts BlockSplitBloomFilter block implementation strategy, and the expected false positive rate fpp is configured to be 0.05 by default. - -The storage of data pages corresponding to BloomFilter index data is similar to that of ZoneMapIndex, and the optimization of secondary pages has been done, which will not be described in detail here. - -### 9.2 Index Generation Rules - -BloomFilter is generated by Page granularity. When data is written to a complete Page, Doris will generate the BloomFilter index data of this Page at the same time according to the Hash strategy. Currently bloom filter does not support tinyint/hll/float/double types, other types are already supported. When using, you need to specify bloom_filter_columns in PROPERTIES The fields to be indexed by BloomFilter. - -### 9.3 Application Cases - -When querying data, the query conditions are filtered in the field with bloom filter set. When the bloom filter is not hit, it means that there is no such data in the page, which can reduce the number of pages to be scanned. - -Case: The schema of the table is as follows: - - - -The query sql here is as follows: - -```sql -SELECT * FROM table WHERE name = 'Zhang San' -``` - -Due to the high degree of discrimination of name, in order to improve the query performance of sql, a BloomFilter index, PROPERTIES ( "bloom_filter_columns" = "name" ), is added to the name data. At query time, the BloomFilter index can filter out a large number of Pages. - -## 10 Bitmap Index index - -Doris also provides BitmapIndex to speed up data queries. - -## 10.1 Storage structure - -Bitmap storage format is as follows: - - - -The meta information of BitmapIndex is also stored in SegmentFootPB. BitmapIndex includes three parts, BitMap type, dictionary information DictColumn, and bitmap index data information BitMapColumn. Among them, DictColumn and BitMapColumn correspond to the IndexedColumnData structure, and store the Page address offset and size of dictionary data and index data respectively. The optimization of the secondary page is also done here, and will not be explained in detail. - -The difference between this and other index storage structures is that the DictColumn dictionary data is LZ4F compressed, and the first value in the Data Page is stored when the secondary Page offset is recorded. - -### 10.2 Index Generation Rules - -When creating a BitMap, it needs to be created through CREATE INDEX. The index of the Bitmap is the index of the Column field in the entire Segment, rather than generating a separate copy for each Page. When writing data, a map structure is maintained to record the row number corresponding to each key value, and the Roaring bitmap is used to encode the rowid. The main structure is as follows: - - - -When generating index data, the dictionary data is first written, and the key value of the map structure is written into the DictColumn. Then, the key corresponds to the Roaring-encoded rowid to write data into BitMapColumn in bytes. - -### 10.3 Application Cases - -When querying data, bitmap indexes can be used to optimize data columns with small degrees of differentiation and small column cardinality. For example, gender, marriage, geographic information, etc. - -Case: The schema of the table is as follows: - - - -The query sql here is as follows: - -```sql -SELECT * FROM table WHERE city in ("Beijing", "Shanghai") -``` - -Since the value of city is relatively small, after the data dictionary and bitmap are established, matching rows can be quickly found by scanning the bitmap. And after bitmap compression, the amount of data itself is small, and the entire column can be accurately matched by scanning less data. - -## 11 Index query process - -When querying data in a Segment, according to the query conditions executed, the data will be filtered first according to the field indexing. Then read the data, the overall query process is as follows: - - - -1. First, a row_bitmap will be constructed according to the number of rows in the Segment, indicating that the data needs to be read to record. If no index is used, all data needs to be read. -2. When the key is used in the query condition according to the prefix index rule, the ShortKey Index will be filtered first, and the ordinal row number range matched in the ShortKey Index can be merged into the row_bitmap. -3. When there is a BitMap Index index in the column field in the query condition, the ordinal row number that meets the conditions will be directly found according to the BitMap index, and the intersection filter with row_bitmap will be obtained. The filtering here is accurate, and after removing the query condition, this field will not be filtered by the subsequent index. -4. When there is a BloomFilter index in the column field in the query condition and the condition is equal (eq, in, is), it will be filtered by the BloomFilter index, here will go through all the indexes, filter the BloomFilter of each Page, and find out the query condition can be All Pages hit. Intersect the ordinal row number range in the index information with row_bitmap. -5. When there is a ZoneMap index in the column field in the query condition, it will be filtered by the ZoneMap index. Here, all the indexes will also be traversed to find all the pages that the query condition can intersect with the ZoneMap. Intersect the ordinal row number range in the index information with row_bitmap. -6. After the row_bitmap is generated, find the specific Data Page in batches through the OrdinalIndex of each Column. -7. Batch read the data of the Column Data Page of each column. When reading, for a page with a null value, judge whether the current row is null according to the null value bitmap. If it is null, it can be filled directly. - -## 12 Summary - -Doris currently adopts a complete column storage structure and provides rich indexes to deal with different query scenarios, laying a solid foundation for Doris's efficient writing and query performance. The Doris storage layer is designed to be flexible, and functions such as new indexes and enhanced data deletion can be further added in the future. diff --git a/blogs/en/DorisInternals/doris-storage-writer-delete.md b/blogs/en/DorisInternals/doris-storage-writer-delete.md deleted file mode 100644 index e6b6efbf545..00000000000 --- a/blogs/en/DorisInternals/doris-storage-writer-delete.md +++ /dev/null @@ -1,150 +0,0 @@ ---- -{ - "title": "Apache Doris storage layer design two write process, delete process analysis", - "description": "This article introduces in detail the internal implementation process of the Doris system during the data writing process, as well as the implementation process of Doris's conditional deletion of data and batch deletion by key.", - "date": "2022-05-20", - "metaTitle": "Apache Doris storage layer design two write process, delete process analysis", - "isArticle": true, - "language": "en", - "author": "ApacheDoris", - "layout": "Article", - "sidebar": false, - "categories": "DorisInternals", -} ---- - -<!-- -Licensed to the Apache Software Foundation (ASF) under one -or more contributor license agreements. See the NOTICE file -distributed with this work for additional information -regarding copyright ownership. The ASF licenses this file -to you under the Apache License, Version 2.0 (the -"License"); you may not use this file except in compliance -with the License. You may obtain a copy of the License at - - http://www.apache.org/licenses/LICENSE-2.0 - -Unless required by applicable law or agreed to in writing, -software distributed under the License is distributed on an -"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY -KIND, either express or implied. See the License for the -specific language governing permissions and limitations -under the License. ---> - -# Apache Doris storage layer design two write process, delete process analysis - -## 1. Overall introduction - -Doris is an interactive SQL data warehouse based on MPP architecture, mainly used to solve near real-time reporting and multidimensional analysis. Doris's efficient import and query are inseparable from the sophisticated design of its storage structure. - -This article mainly analyzes the implementation principle of the storage layer of the Doris BE module by reading the code of the Doris BE module, and expounds and decrypts the core technology behind the efficient writing and query capabilities of Doris. It includes Doris column storage design, index design, data read and write process, Compaction process, version management of Tablet and Rowset, data backup and other functions. - -This article introduces the storage layer structure of the Segment V2 version, including rich functions such as ordered storage, sparse index, prefix index, bitmap index, BloomFilter, etc., which can provide extremely fast query capabilities for various complex scenarios. - -This article introduces in detail the internal implementation process of the Doris system during the data writing process, as well as the implementation process of Doris's conditional deletion of data and batch deletion by key - -## 2 Glossary - -- **FE:** Frontend, the front-end node of Doris. It is mainly responsible for receiving and returning client requests, metadata, cluster management, and query plan generation. -- **BE:** Backend, the backend node of Doris. Mainly responsible for data storage and management, query plan execution and other work. -- **Tablet:** Tablet is the actual physical storage unit of a table. A table is stored in units of Tablet in the distributed storage layer formed by BE after partitioning and bucketing. Each Tablet includes meta information and several a continuous RowSet. -- **Rowset: ** Rowset is the data set of a data change in the Tablet, and the data change includes data import, deletion, update, etc. Rowset records by version information. A version is generated for each change. -- **Version:** consists of two attributes, Start and End, and maintains the record information of data changes. Usually used to indicate the version range of Rowset, after a new import generates a Rowset with equal Start and End, and after Compaction generates a Rowset version with a range. -- **Segment:** Indicates the data segment in the Rowset. Multiple Segments form a Rowset. -- **Compaction:** The process of merging consecutive versions of Rowset is called Compaction, and the data is compressed during the merging process. - -## 3 Write process - -Doris supports various forms of data writing methods for different scenarios, including importing Broker Load from other storage sources, importing HTTP synchronous data into Stream Load, routine Routine Load import and Insert Into writing, etc. At the same time, the import process will involve FE module (mainly responsible for import plan generation and import task scheduling), BE module (mainly responsible for ETL and storage of data), and Broker module (providing Doris with the abilit [...] - -The following takes Stream Load writing as an example to describe the overall data writing process of Doris as shown in the following figure: - - - -The process is described as follows: - -1. FE receives the user's write request and randomly selects BE as the Coordinator BE. Redirect the user's request to this BE. -2. The Coordinator BE is responsible for receiving the user's data write request, and at the same time requesting the FE to generate an execution plan and schedule and manage the import task LoadJob and import transaction. -3. The Coordinator BE schedules the execution of the import plan, and performs data verification and cleaning. -4. The data is written to the storage layer of the BE. In this process, it will be written to the memory first, and after a certain amount of data is filled, it will be written to the physical disk according to the data format of the storage layer. - -This article mainly introduces the detailed process of writing data to the BE storage layer. The rest of the process is not described in detail. - -### 3.1 Data distribution process - -After the data is cleaned and filtered, the data will be sent to the BE nodes of the storage layer in batches through Open/AddBatch requests. Multiple LoadJob tasks are supported concurrently for concurrent write execution on a BE. LoadChannelMgr manages these tasks and distributes the data. - -The data distribution and writing process is shown in the following figure: - - - -1. Each time an import task LoadJob will create a LoadChannel to execute, LoadChannel maintains an imported channel, and LoadChannel can write data in batches until the import is complete. - -2. LoadChannel will create a TabletsChannel to perform specific import operations. A TabletsChannel corresponds to multiple Tablets. In a batch data write operation, TabletsChannel distributes the data to the corresponding Tablet, and the DeltaWriter writes the data to the Tablet, and the real write operation begins. - -### 3.2 DeltaWriter and Memtable - -DeltaWriter is mainly responsible for continuously receiving newly written batches of data and completing the data writing of a single Tablet. Since the new data can be incremental Delta parts, it is called DeltaWriter. - -DeltaWriter uses an LSM tree-like structure for data writing. The data is first written to the Memtable. When the Memtable data is full, it will asynchronously flush to generate a Segment for persistence, and at the same time generate a new Memtable to continue to receive new data for import. This flush operation is done by the MemtableFlushExecutor executor. - -In Memtable, the skip table structure is used to sort the data, and the sorting rule uses the order of the keys of the schema to compare the fields in turn. This ensures that the data written in each write segment is ordered. If the current model is a non-DUP model (AGG model and UNIQUE model), the data of the same key will also be aggregated. - -### 3.3 Physical Write - -#### 3.3.1 RowsetWriter module design - -Writing at the physical storage level is done by RowsetWriter. RowsetWriter is further divided into sub-modules such as SegmentWriter, ColumnWriter, PageBuilder, and IndexBuilder. - -1. RowsetWriter completes the writing of an import LoadJob task as a whole, and an import LoadJob task will generate a Rowset, and a Rowset represents the data version that is successfully imported once. In implementation, RowsetWriter is responsible for completing the writing of Rowset. -2. SegmentWriter is responsible for implementing Segment writing. A Rowset can consist of multiple Segment files. -3. ColumnWriter is included in SegmentWriter. The segment file is a complete column storage structure. Segment contains each column and related index data. The writing of each column is responsible for writing by ColumnWriter. -4. In the file storage format, data and indexes are organized by Page, and ColumnWriter includes PageBuilder for generating data Page and IndexBuilder for generating index Page to complete the writing of Page. -5. Finally, FileWritableBlock is responsible for reading and writing specific files. For the storage format of the file, please refer to the document "Introduction to Doris Storage Layer Design 1 - Analysis of Storage Structure Design". - -#### 3.3.2 RowsetWriter writing process - -The overall physical writing is shown in the following figure: - - - -Detailed description of the physical write process: - -1. When a Memtable is full (the default is 100M), the data in the Memtable will be flushed to the disk, and the data in the Memtable will be ordered by key. It is then written to the RowsetWriter row by row. -2. The RowsetWriter also writes the data line by line to the SegmentWriter, and the RowsetWriter maintains the currently being written SegmentWriter and the list of file blocks to be written. Each time a segment is written, a file block will be added. -3. SegmentWriter writes data to each ColumnWriter row by row, and writes ShortKeyIndexBuilder at the same time. ShortKeyIndexBuilder is mainly responsible for generating the index Page of ShortKeyIndex. For the specific ShortKeyIndex index format, please refer to the document "Introduction to Doris Storage Layer Design 1 - Storage Structure Design Analysis". -4. ColumnWriter writes data into PageBuilder and each IndexBuilder respectively. PageBuilder is used to generate PageBuilder for ColumnData data. Each IndexBuilder includes (OrdinalIndexBuilder generates Page format of OrdinalIndex row number sparse index, ZoneMapIndexBuilder generates Page format of ZoneMapIndex index, BitMapIndexBuilder generates BitMapIndex index Page format, BloomFilterIndexBuilder generates the Page format of the BloomFilterIndex index). For details, refer to Doris [...] -5. After adding data, the RowsetWriter performs a flush operation. -6. The flush operation of SegmentWriter writes data and indexes to disk. The read and write to the disk is done by FileWritableBlock. -7. ColumnWriter writes the respective data and pages generated by the index to the file in sequence. -8. SegmentWriter generates SegmentFooter information, and SegmentFooter records the original data information of the Segment file. After completing the write operation, RowsetWriter will start a new SegmentWriter and write the next Memtable to the new Segment until the import is complete. - -### 3.4 Posted by Rowset - -When the data import is complete, DeltaWriter will publish the newly generated Rowset. The release is to set the Rowset of this version to the visible state, indicating that the imported data has become effective and can be queried. The version information indicates the order in which the Rowset takes effect. An import will generate a Rowset, and each time the import is successful, the version will be increased in order. The entire release process is as follows: - -1. DeltaWriter counts the current RowsetMeta metadata information, including the number of rows, bytes, time, and segments. -2. Save to RowsetMeta and submit the import transaction to FE. The current import transaction is opened by FE to ensure that the data of each BE node is imported at the same time and takes effect at the same time. -3. After the FE is coordinated, the FE will issue a Publish task to make the imported Rowset version take effect. The release's effective version version information is specified in the task. Only then will the BE storage layer make this version of the Rowset visible. -4. Rowset is added to the Tablet of the BE storage layer for management. - -## 4 delete process - -At present, there are two implementations of Delete, a common delete type is DELETE, and the other is LOAD_DELETE. - -### 4.1 DELETE execution flow - -DELETE supports general deletion operations, and the implementation is relatively simple. In DELETE mode, there is no actual deletion of data, but data deletion conditions are recorded. Stored in Meta information. Delete conditions are incorporated into the Base version together when Base Compaction is performed. The Base version is the first Rowset data version of the Tablet from [0-x]. The specific process is as follows: - -1. When deleting, the FE will directly issue the delete command and delete conditions. -2. BE starts an EngineBatchLoadTask task locally, generates a new version of Rowset, and records the deletion condition information. The Rowset of this deletion record is slightly different from that of the writing process. The Rowset only records the deletion condition information without actual data. -3. FE also publishes the effective version. The Rowset will be added to the Tablet and the TabletMeta information will be saved. - -### 4.2 LOAD_DELETE execution flow - -LOAD_DELETE supports the ability to delete data by importing the keys to be deleted in batches under the UNIQUE KEY model, which can support large-scale data deletion. The overall idea is to add a deletion status flag to the data record, and the deleted key will be compressed in the Compaction process. Compaction is mainly responsible for merging multiple Rowset versions, and the Compaction process will be described in detail in subsequent articles. - -## 5 Summary - -This article introduces the writing process and deletion process of the underlying storage layer of the Doris system in detail. It first describes the overall writing process of Doris, and then analyzes in detail the design of Doris's LSM-like storage structure, the data distribution and physical writing process in the memory part, the Rowset version release and other processes, and finally introduces the two supported by Doris. Data deletion method。 --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@doris.apache.org For additional commands, e-mail: commits-h...@doris.apache.org