[Proposal] Vectorization Execution Engine optimization for Doris

Lee Happen Fri, 23 Jul 2021 00:16:46 -0700

Motivation

At present, the underlying storage in Doris is column storage. But query 
execution needs to be transferred to the query layer for execution by 
row-to-column first. Such an implementation maybe cause the performance problem。


  *
     *   Column-to-row loss.
  *
     *   Can not get better CPU performance without vectorized execution.

Currently, vectorized execution has been commonly adopted in mainstream MPP 
databases, which has a significant effect on improving CPU utilization. In this 
paper, we investigate the knowledge of vectorization and give a detailed design 
for implementing vectorization on columnar storage.

What is vectorization execution

Organizing data in this batched, columnar fashion is the primary prerequisite 
for using SIMD CPU instructions, which operate on a vector of data at a time. 
Using SIMD instructions is an eventual goal.

Now suppose there is a table People, the data content is as follows:

Id      Name    Age
101     Ivan    22
115     Peggy   37
114     Victor  45
113     Eve     25
112     Walter  19
109     Trudy   31
108     Bob     27
105     Zoe     29
104     Charlie 42
102     Alice   35

Execute the following query on the table:SELECT Id, Name, Age, (Age - 30) * 50 
AS Bonus FROM People WHERE Age > 30

Volcano model

[image.png]<https://camo.githubusercontent.com/d958b2f88b643b0ea00fc59b591b712e7ceb091591fbe07c1588ddbe861fb447/68747470733a2f2f75706c6f61642d696d616765732e6a69616e7368752e696f2f75706c6f61645f696d616765732f383535323230312d363735653739333964343334616161622e706e673f696d6167654d6f6772322f6175746f2d6f7269656e742f7374726970253743696d61676556696577322f322f772f31323430>

For traditional relational databases, the way it returns data is on a per-row 
basis. The problems are: one function call per line, which interrupts CPU flow 
and is not conducive to branch prediction; high instruction and data cache 
misses; unfriendly compiler, not conducive to loop expansion, and not conducive 
to using CPU acceleration instructions such as SIMD.

Vectorization Execution Engine

[image.png]<https://camo.githubusercontent.com/a26e4e3439257099951845b7c7992af564051a252f4dae9fe580e819cebb5962/68747470733a2f2f75706c6f61642d696d616765732e6a69616e7368752e696f2f75706c6f61645f696d616765732f383535323230312d373433323964303434346236643266392e706e673f696d6167654d6f6772322f6175746f2d6f7269656e742f7374726970253743696d61676556696577322f322f772f31323430>

But for columnar databases, you can actually do it in the form of columns, 
which can reduce the overall cache failure rate. This is equivalent to a 
vectorized execution to achieve such an effect, which is a unique optimization 
technique for columnar databases to reduce CPU consumption and improve the 
overall CPU utilization.

It can bring us the following benefits:

  *
     *   Reduced dummy function calls for branch prediction.
  *
     *   Reduces interpretation overhead by getting a batch of results at a 
time.
  *
     *   Easy for compiler to do loop pipelining and SIMD optimization.
  *
     *   Friendly to CPU L1 and L2 Cache.

How Vectorization Execution Engine For Doris
The key idea and challenge

  *   Design a new memory structure to replace the original RowBatch and Tuple 
structure
  *   Rewrite all operators to support vectorization/columnar computation

1. Data Structure

We have implemented all types natively supported by Doris in Column MemLayout

  *   tinyint
  *   smallint
  *   int
  *   bigint
  *   largeint
  *   boolean
  *   float
  *   double
  *   decimal
  *   date
  *   datetime
  *   HLL
  *   bitmap

2. Function and AggregateFunction

We have implemented more than 80% of Doris's native support functions 
implementation of Vectorized Function and AggregateFunction interfaces, and 
have completed a large number of SIMD support.

And for a more efficient SIMD optimization, the type of NULLABLE for each 
function is identified as follows.

[image.png]<https://camo.githubusercontent.com/69d48e92d7b88c3271084c8716e86031a43169789adf4d7a6c683cd2d45eb922/68747470733a2f2f75706c6f61642d696d616765732e6a69616e7368752e696f2f75706c6f61645f696d616765732f383535323230312d313533356338373536643134623434372e706e673f696d6167654d6f6772322f6175746f2d6f7269656e742f7374726970253743696d61676556696577322f322f772f31323430>

3. Operator
Operator

After the data is read from the storage layer, it is transformed from RowCursor 
to Vectorized Column structures, i.e., from row storage to column storage. 
Starting from VOlapScanNode, the data is organized in the form of columns.

[image.png]<https://camo.githubusercontent.com/5ae60da39b57f312f2595458f039172fba5766a2a9f5aa4d57d9379b52ab2ba3/68747470733a2f2f75706c6f61642d696d616765732e6a69616e7368752e696f2f75706c6f61645f696d616765732f383535323230312d643930636662666361616533323432362e706e673f696d6167654d6f6772322f6175746f2d6f7269656e742f7374726970253743696d61676556696577322f322f772f31323430>

4. Interface

Later we will gradually implement some basic vectorized interfaces and submit 
them to the community.


Thanks,
Li HaoPeng

[Proposal] Vectorization Execution Engine optimization for Doris

Reply via email to