wangbo opened a new issue #4788:
URL: https://github.com/apache/incubator-doris/issues/4788


   **Why Do So**
   We observe that Doris's query performance is significantly slower that Kylin 
when the query contains bitmap computation.
   The reason is that even a query hits rollup,doris's still need to do 
additional data scan and computation.
   Doris's rollup use base table's distribution key which causes that rollup's 
bucket data may still has intersection.
   Bitmap/HLL are greatly affected by this situation.
   
   **Solution**
   Use rollup's aggregation key as rollup's bucket key to make data truly 
pre-aggregated.
   
   **POC**
   env:
   * 1 FE,3BE
   * data:
        * v2 storage format,one replica
        * six bitmap column, each column's cardinality is about 5000,0000
   
   Test result
   * test sql: the sql completely hits rollup which contains six bitmap column
   * case 1 : data has just load to Doris BE and not compaction completely
        * rollup use base key as bucket key:
                * query time:14s
        * rollup use agg-key as buceck key:6s
                * query time:6s
   * case 2: data compaction completely
        * rollup use base key as bucket key:
                * first query time(without OS cache and BE's page cache):1.2s
                * second querty time(hits be'a page cache): 1.0s
                * scan bytes:241M
                * scan rows: 1104
                * return rows: 1104
        * rollup use agg-key as buceck key:6s
                * first query time(without OS cache and BE's page cache):1.2s
                * second querty time(hits be'a page cache): 1.0s
                * scan bytes:662M
                * scan rows: 10079
                * return rows: 1104
   * case 3: data consistent
        * query result is same whether rollup use rollup's agg-key as bucket 
key or rollup use base table's distribution key as bucket key
   
   So we can see that when using rollup's agg-key as bucket key by is about 
three times performance improve than using base table's distribution key.
   And because of rollup is truely pre-aggreagtion, scan data and computation 
is reduced.
   
   **Future Work**
   * Stage 1 :  Make this feature available in production env quickly
   Make it a configurable property for OLAP table when user wants to use rollup 
key's agg-key as bucket key in stream load/spark load.
   Even they can set rollup's bucket num.
   
   * Stage 2: Support Schema Change
   I prefer to support schema change for this feature in Spark Job.
   The reason as below:
   ```Read Write Separation``` is a necessary feature for Doris.
   This feature needs to shuffle data when doing schema change would have a 
greater impact on the stability for Doris 
    Especially for a big table.
    I don't think Doris is good at and need to be good at long time shuffle.
    So Spark is the best choice.
   
   * Stage 3: Support Colocate Join
       Rollup has different agg-key need to shuffle join when query.
   
   * Stage 4: Support Materialized view
       Need further research.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to