This is an automated email from the ASF dual-hosted git repository. luzhijing pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/doris-website.git
The following commit(s) were added to refs/heads/master by this push: new 2b343263034 [doc] update runtime-filter.md (#848) 2b343263034 is described below commit 2b3432630345d7e54c3b0cb743f36724e4dde46b Author: Pxl <pxl...@qq.com> AuthorDate: Wed Jul 17 09:54:25 2024 +0800 [doc] update runtime-filter.md (#848) update runtime-filter.md --- docs/query/join-optimization/runtime-filter.md | 204 +++++++++++---------- .../query/join-optimization/runtime-filter.md | 204 ++++++++++----------- .../query/join-optimization/runtime-filter.md | 204 ++++++++++----------- .../query/join-optimization/runtime-filter.md | 204 +++++++++++---------- 4 files changed, 394 insertions(+), 422 deletions(-) diff --git a/docs/query/join-optimization/runtime-filter.md b/docs/query/join-optimization/runtime-filter.md index 8cf514fbe68..2d46d03af2f 100644 --- a/docs/query/join-optimization/runtime-filter.md +++ b/docs/query/join-optimization/runtime-filter.md @@ -26,15 +26,14 @@ under the License. # Runtime Filter -Runtime Filter is designed to dynamically generate filter conditions for certain Join queries at runtime to reduce the amount of scanned data, avoid unnecessary I/O and network transmission, and speed up the query. - +Runtime Filter is designed to dynamically generate filter conditions for certain Join queries at runtime to reduce the amount of data scanned, avoid unnecessary I/O and calculations, and thereby speed up queries. ## Noun Interpretation * Left table: the table on the left during Join query. Perform Probe operation. The order can be adjusted by Join Reorder. * Right table: the table on the right during Join query. Perform the Build operation. The order can be adjusted by Join Reorder. * Fragment: FE will convert the execution of specific SQL statements into corresponding fragments and send them to BE for execution. The corresponding Fragment is executed on the BE, and the results are aggregated and returned to the FE. * Join on clause: `Aa=Bb` in `A join B on Aa=Bb`, based on this to generate join conjuncts during query planning, including expr used by join Build and Probe, where Build expr is called in Runtime Filter src expr, Probe expr are called target expr in Runtime Filter. - +- rf: Abbreviation of Runtime Filter. ## Principle Runtime Filter is generated during query planning, constructed in HashJoinNode, and applied in ScanNode. @@ -75,35 +74,35 @@ If the filter condition (Runtime Filter) can be pushed down to the storage engin | T1 T2 | ``` -It can be seen that, unlike predicate push-down and partition cutting, Runtime Filter is a filter condition dynamically generated at runtime, that is, when the query is run, the join on clause is parsed to determine the filter expression, and the expression is broadcast to ScanNode that is reading the left table , Thereby reducing the amount of scanned data, thereby reducing the number of probe hash table, avoiding unnecessary I/O and network transmission. +It can be seen that, unlike predicate pushdown and partition pruning, Runtime Filter is a filter condition dynamically generated at runtime, that is the join on clause is parsed to determine the filter expression when the query is running, and the expression is broadcast to the ScanNode that is reading the left table , thereby reducing the amount of data scanned, thereby reducing the number of probe hash tables and avoiding unnecessary I/O and calculations. -Runtime Filter is mainly used to optimize joins for large tables. If the amount of data in the left table is too small, or the amount of data in the right table is too large, the Runtime Filter may not achieve the expected effect. +Runtime Filter is mainly used to optimize the join of large tables and small tables. If the amount of data in the left table is too small, the effect of rf's early filtering may not be great. If the amount of data in the right table is too large, there will be a relatively large cost when building and transmitting rf. ## Usage ### Runtime Filter query options -For query options related to Runtime Filter, please refer to the following sections: +The default configuration has been adapted to most scenarios as much as possible. Only in some specific scenarios, further adjustments are required to achieve the best results. Usually, optimization is only performed for resource-intensive queries that take a long enough time to run and are frequent enough after performance testing. -- The first query option is to adjust the type of Runtime Filter used. In most cases, you only need to adjust this option, and keep the other options as default. +For configuration options related to Runtime Filter, please refer to the following section: - - `runtime_filter_type`: Including Bloom Filter, MinMax Filter, IN predicate, IN_OR_BLOOM Filter and Bitmap_Filter. By default, only IN_OR_BLOOM Filter will be used. In some cases, the performance will be higher when both Bloom Filter, MinMax Filter and IN predicate are used at the same time. +- `enable_sync_runtime_filter_size`: When the optimizer cannot accurately estimate the cardinality, the executor is required to synchronize and obtain the global Build end size before generating rf, and determine the final type of IN Or Bloom Filter and the size of Bloom Filter based on this actual size. If set to false, no synchronization operation is performed to obtain the global size. The default value of this variable is true. -- Other query options usually only need to be further adjusted in certain specific scenarios to achieve the best results. Usually only after performance testing, optimize for resource-intensive, long enough running time and high enough frequency queries. +- `runtime_filter_max_in_num`: If the Build-side size is larger than this value, we will not generate IN predicate. The default value of this variable is 1024. - - `runtime_filter_mode`: Used to adjust the push-down strategy of Runtime Filter, including three strategies of OFF, LOCAL, and GLOBAL. The default setting is the GLOBAL strategy +- `runtime_filter_mode`: Used to adjust the generation strategy of rf, including OFF, LOCAL, and GLOBAL. If set to OFF, rf will not be generated. The default value of this variable is GLOBAL. - - `runtime_filter_wait_time_ms`: the time that ScanNode in the left table waits for each Runtime Filter, the default is 1000ms +- `runtime_filter_type`: The types of rf allowed to be generated, including Bloom Filter, MinMax Filter, IN predicate, IN Or Bloom Filter, and Bitmap Filter. The default value of this variable is IN_OR_BLOOM_FILTER,MIN_MAX. - - `runtime_filters_max_num`: The maximum number of Bloom Filters in the Runtime Filter that can be applied to each query, the default is 10 +- `runtime_filter_wait_infinitely`: If set to true, the scan node of the left table will wait until rf is received or the query times out, which is equivalent to runtime_filter_wait_time_ms being set to infinity. The default value of this variable is false. - - `runtime_bloom_filter_min_size`: the minimum length of Bloom Filter in Runtime Filter, default 1048576 (1M) +- `runtime_filter_wait_time_ms`: The time the ScanNode of the left table waits for rf. If the waiting time has passed and no rf is received, the ScanNode will start scanning the data first, and the rf received later will take effect on the data that the ScanNode has not returned at this moment. The default value of this variable is 1000. - - `runtime_bloom_filter_max_size`: the maximum length of Bloom Filter in Runtime Filter, the default is 16777216 (16M) +- `runtime_bloom_filter_min_size`: The minimum length of the Bloom Filter in the rf estimated by the optimizer. The default value of this variable is 1048576 (1M). - - `runtime_bloom_filter_size`: The default length of Bloom Filter in Runtime Filter, the default is 2097152 (2M) +- `runtime_bloom_filter_max_size`: The maximum length of the Bloom Filter in the rf estimated by the optimizer. The default value of this variable is 16777216 (16M). - - `runtime_filter_max_in_num`: If the number of rows in the right table of the join is greater than this value, we will not generate an IN predicate, the default is 1024 +- `runtime_bloom_filter_size`: The default length of the Bloom Filter in the rf estimated by the optimizer. The default value of this variable is 2097152 (2M). The query options are further explained below. @@ -121,23 +120,28 @@ set runtime_filter_type=7; **Precautions for use** -- **IN or Bloom Filter**: According to the actual number of rows in the right table during execution, the system automatically determines whether to use IN predicate or Bloom Filter. - - By default, IN Predicate will be used when the number of data rows in the right table is less than 102400 (which can be adjusted by ` runtime_filter_max_in_num 'in the session variable). Otherwise, use bloom filter. -- **Bloom Filter**: There is a certain misjudgment rate, which results in the filtered data being a little less than expected, but it will not cause the final result to be inaccurate. In most cases, Bloom Filter can improve performance or has no significant impact on performance, but in some cases Under circumstances will cause performance degradation. - - Bloom Filter construction and application overhead is high, so when the filtering rate is low, or the amount of data in the left table is small, Bloom Filter may cause performance degradation. - - At present, only the Key column of the left table can be pushed down to the storage engine if the Bloom Filter is applied, and the test results show that the performance of the Bloom Filter is often reduced when the Bloom Filter is not pushed down to the storage engine. - - Currently Bloom Filter only has short-circuit logic when using expression filtering on ScanNode, that is, when the false positive rate is too high, the Bloom Filter will not continue to be used, but there is no short-circuit logic when the Bloom Filter is pushed down to the storage engine , So when the filtration rate is low, it may cause performance degradation. +- **IN or Bloom Filter**: Based on the actual number of rows in the right table during execution, the system automatically determines whether to use IN predicate or Bloom Filter. + + - By default, IN predicate will be used when the number of data rows in the right table is less than runtime_filter_max_in_num, otherwise Bloom filter will be used. + +- **Bloom Filter**: There is a certain misjudgment rate, resulting in a little less filtered data than expected, but it will not cause the final result to be inaccurate. In most cases, Bloom Filter can improve performance or have no significant impact on performance. Impact, but may result in reduced performance in some cases. + + - Bloom Filter construction and application overhead is high, so when the filtering rate is low, or when the amount of data in the left table is small, Bloom Filter may cause performance degradation. + - If the Bloom Filter is too large, it may take longer to build/transmit/filter. + + +- **MinMax Filter**: Contains the maximum value and the minimum value, thereby filtering data smaller than the minimum value and larger than the maximum value. The filtering effect of MinMax Filter is related to the type of the Key column in the join on clause and the data distribution of the left and right tables. + + - When the type of the Key column in the join on clause is int/bigint/double, etc., in extreme cases, if the maximum and minimum values of the left and right tables are the same, there will be no effect. Otherwise, the maximum value of the right table is smaller than the minimum value of the left table, or the right table is the smallest. If the value is greater than the maximum value in the left table, the effect will be best. + + - When the type of the Key column in the join on clause is varchar, etc., applying MinMax Filter will often lead to performance degradation. -- **MinMax Filter**: Contains the maximum value and the minimum value, thereby filtering data smaller than the minimum value and greater than the maximum value. The filtering effect of the MinMax Filter is related to the type of the Key column in the join on clause and the data distribution of the left and right tables. - - When the type of the Key column in the join on clause is int/bigint/double, etc., in extreme cases, if the maximum and minimum values of the left and right tables are the same, there is no effect, otherwise the maximum value of the right table is less than the minimum value of the left table, or the minimum of the right table The value is greater than the maximum value in the left table, the effect is best. - - When the type of the Key column in the join on clause is varchar, etc., applying the MinMax Filter will often cause performance degradation. +- **IN predicate**: Construct an IN predicate based on all the values of the Key column in the join on clause on the right table, and use the constructed IN predicate to filter on the left table. Compared with Bloom Filter, the construction and application overhead is lower. Performance is often higher when the amount of data in the right table is smaller. -- **IN predicate**: Construct IN predicate based on all the values of Key listed in the join on clause on the right table, and use the constructed IN predicate to filter on the left table. Compared with Bloom Filter, the cost of construction and application is lower. The amount of data in the right table is lower. When it is less, it tends to perform better. - - Currently IN predicate already implement a merge method. - - When IN predicate and other filters are specified at the same time, and the filtering value of IN predicate does not reach runtime_filter_max_in_num will try to remove other filters. The reason is that IN predicate is an accurate filtering condition. Even if there is no other filter, it can filter efficiently. If it is used at the same time, other filters will do useless work. Currently, only when the producer and consumer of the runtime filter are in the same fragment can there be [...] + - When In predicate and other filters are specified at the same time, and the filter value of in does not reach runtime_filter_max_in_num, other filters will be tried to be removed. The reason is that In predicate is a precise filtering condition, which can filter efficiently even without other filters. If used at the same time, other filters will do useless work. - **Bitmap Filter**: - - Currently, the bitmap filter is used only when the subquery in the [in subquery](../../sql-manual/sql-statements/Operators/in) operation returns a bitmap column. + - Bitmap filter is currently only used when the subquery in the [in subquery](../../sql-manual/sql-statements/Operators/in) operation returns a bitmap column. #### 2.runtime_filter_mode Used to control the transmission range of Runtime Filter between instances. @@ -163,30 +167,15 @@ Waiting for Runtime Filter is time consuming. **Precautions for use** -After the Runtime Filter is turned on, the ScanNode in the table on the left will wait for a period of time for each Runtime Filter assigned to itself before scanning the data, that is, if the ScanNode is assigned 3 Runtime Filters, it will wait at most 3000ms. +After the Runtime Filter is turned on, the ScanNode of the left table will wait for a while for the Runtime Filter assigned to it before scanning the data. Because it takes time to build and merge the Runtime Filter, ScanNode will try to push down the Runtime Filter that arrives within the waiting time to the storage engine. If the waiting time is exceeded, ScanNode will directly start scanning data using the Runtime Filter that has arrived. -If the Runtime Filter arrives after ScanNode starts scanning, ScanNode will not push the Runtime Filter down to the storage engine. Instead, it will use expression filtering on ScanNode based on the Runtime Filter for the data that has been scanned from the storage engine. The scanned data will not apply the Runtime Filter, so the intermediate data size obtained will be larger than the optimal solution, but serious cracking can be avoided. +If the Runtime Filter arrives after the ScanNode starts scanning, the ScanNode will not push the Runtime Filter down to the storage engine. Instead, the ScanNode will use an expression to filter the data that has been scanned from the storage engine based on the Runtime Filter. The Runtime Filter will not be applied to the data that has been scanned before. The size of the intermediate data obtained in this way will be larger than the optimal solution, but serious degradation can be avoided. If the cluster is busy and there are many resource-intensive or long-time-consuming queries on the cluster, consider increasing the waiting time to avoid missing optimization opportunities for complex queries. If the cluster load is light, and there are many small queries on the cluster that only take a few seconds, you can consider reducing the waiting time to avoid an increase of 1s for each query. -#### 4.runtime_filters_max_num -The upper limit of the number of Bloom Filters in the Runtime Filter generated by each query. - -**Type**: integer, default 10 - -**Precautions for use** -Currently, only the number of Bloom Filters is limited, because the construction and application of Bloom Filters are more expensive than MinMax Filter and IN predicate. - -If the number of Bloom Filters generated exceeds the maximum allowable number, then the Bloom Filter with a large selectivity is retained. A large selectivity means that more rows are expected to be filtered. This setting can prevent Bloom Filter from consuming too much memory overhead and causing potential problems. -``` -Selectivity = (HashJoinNode Cardinality / HashJoinNode left child Cardinality) -- Because the cardinality of FE is currently inaccurate, the selectivity of Bloom Filter calculation here is inaccurate, so in the end, it may only randomly reserve part of Bloom Filter. -``` -This query option needs to be adjusted only when tuning some long-consuming queries involving joins between large tables. - -#### 5. Bloom Filter length related parameters +#### 4. Bloom Filter length related parameters Including `runtime_bloom_filter_min_size`, `runtime_bloom_filter_max_size`, `runtime_bloom_filter_size`, used to determine the size (in bytes) of the Bloom Filter data structure used by the Runtime Filter. **Type**: Integer @@ -194,7 +183,8 @@ Including `runtime_bloom_filter_min_size`, `runtime_bloom_filter_max_size`, `run **Precautions for use** Because it is necessary to ensure that the length of the Bloom Filter constructed by each HashJoinNode is the same to be merged, the length of the Bloom Filter is currently calculated in the FE query planning. -If you can get the number of data rows (Cardinality) in the statistical information of the join right table, it will try to estimate the optimal size of the Bloom Filter based on Cardinality, and round to the nearest power of 2 (log value with the base 2). If the Cardinality of the table on the right cannot be obtained, the default Bloom Filter length `runtime_bloom_filter_size` will be used. `runtime_bloom_filter_min_size` and `runtime_bloom_filter_max_size` are used to limit the minimu [...] +If the number of data rows (Cardinality) in the statistics of the right table of the join can be obtained, the optimal size of the Bloom Filter will be estimated based on the Cardinality and rounded to the nearest power of 2 (log value with base 2). If there is no accurate statistics, but enable_sync_runtime_filter_size is turned on, the optimal size of the Bloom Filter will be estimated based on the actual number of data rows at runtime, but there will be some performance overhead cause [...] +Finally, if the Cardinality of the right table is still not available, the default Bloom Filter length `runtime_bloom_filter_size` will be used. `runtime_bloom_filter_min_size` and `runtime_bloom_filter_max_size` are used to limit the minimum and maximum lengths of the Bloom Filter that are ultimately used. Larger Bloom Filters are more effective when processing high-cardinality input sets, but require more memory. If the query needs to filter high cardinality columns (for example, containing millions of different values), you can consider increasing the value of `runtime_bloom_filter_size` for some benchmark tests, which will help make the Bloom Filter filter more accurate, so as to obtain the expected Performance improvement. @@ -215,39 +205,64 @@ CREATE TABLE test2 (t2 INT) DISTRIBUTED BY HASH (t2) BUCKETS 2 PROPERTIES("repli INSERT INTO test2 VALUES (3), (4), (5); EXPLAIN SELECT t1 FROM test JOIN test2 where test.t1 = test2.t2; -+-------------------------------------------------------------------+ -| Explain String | -+-------------------------------------------------------------------+ -| PLAN FRAGMENT 0 | -| OUTPUT EXPRS:`t1` | -| | -| 4:EXCHANGE | -| | -| PLAN FRAGMENT 1 | -| OUTPUT EXPRS: | -| PARTITION: HASH_PARTITIONED: `default_cluster:ssb`.`test`.`t1` | -| | -| 2:HASH JOIN | -| | join op: INNER JOIN (BUCKET_SHUFFLE) | -| | equal join conjunct: `test`.`t1` = `test2`.`t2` | -| | runtime filters: RF000[in] <- `test2`.`t2` | -| | | -| |----3:EXCHANGE | -| | | -| 0:OlapScanNode | -| TABLE: test | -| runtime filters: RF000[in] -> `test`.`t1` | -| | -| PLAN FRAGMENT 2 | -| OUTPUT EXPRS: | -| PARTITION: HASH_PARTITIONED: `default_cluster:ssb`.`test2`.`t2` | -| | -| 1:OlapScanNode | -| TABLE: test2 | -+-------------------------------------------------------------------+ --- The line of `runtime filters` above shows that `2:HASH JOIN` of `PLAN FRAGMENT 1` generates IN predicate with ID RF000, --- Among them, the key values of `test2`.`t2` are only known at runtime, --- This IN predicate is used in `0:OlapScanNode` to filter unnecessary data when reading `test`.`t1`. ++--------------------------------------------------------------------------------------------------+ +| Explain String(Nereids Planner) | ++--------------------------------------------------------------------------------------------------+ +| PLAN FRAGMENT 0 | +| OUTPUT EXPRS: | +| t1[#4] | +| PARTITION: HASH_PARTITIONED: t1[#1] | +| | +| HAS_COLO_PLAN_NODE: false | +| | +| VRESULT SINK | +| MYSQL_PROTOCAL | +| | +| 3:VHASH JOIN(157) | +| | join op: INNER JOIN(BUCKET_SHUFFLE)[] | +| | equal join conjunct: (t1[#1] = t2[#0]) | +| | runtime filters: RF000[min_max] <- t2[#0](3/4/2048), RF001[in_or_bloom] <- t2[#0](3/4/2048) | +| | cardinality=3 | +| | vec output tuple id: 3 | +| | output tuple id: 3 | +| | vIntermediate tuple ids: 2 | +| | hash output slot ids: 1 | +| | final projections: t1[#2] | +| | final project output tuple id: 3 | +| | distribute expr lists: t1[#1] | +| | distribute expr lists: t2[#0] | +| | | +| |----1:VEXCHANGE | +| | offset: 0 | +| | distribute expr lists: t2[#0] | +| | | +| 2:VOlapScanNode(150) | +| TABLE: test.test(test), PREAGGREGATION: ON | +| runtime filters: RF000[min_max] -> t1[#1], RF001[in_or_bloom] -> t1[#1] | +| partitions=1/1 (test) | +| tablets=2/2, tabletList=61032,61034 | +| cardinality=4, avgRowSize=0.0, numNodes=1 | +| pushAggOp=NONE | +| | +| PLAN FRAGMENT 1 | +| | +| PARTITION: HASH_PARTITIONED: t2[#0] | +| | +| HAS_COLO_PLAN_NODE: false | +| | +| STREAM DATA SINK | +| EXCHANGE ID: 01 | +| BUCKET_SHFFULE_HASH_PARTITIONED: t2[#0] | +| | +| 0:VOlapScanNode(151) | +| TABLE: test.test2(test2), PREAGGREGATION: ON | +| partitions=1/1 (test2) | +| tablets=2/2, tabletList=61041,61043 | +| cardinality=3, avgRowSize=0.0, numNodes=1 | +| pushAggOp=NONE | ++--------------------------------------------------------------------------------------------------+ +-- The line of `runtime filters` above shows that `2:HASH JOIN` of `PLAN FRAGMENT 1` generates min_max with ID RF000 and in_or_bloom with ID RF001, +-- RF000/RF001 are used in `2:VOlapScanNode(150)` to filter unnecessary data when reading `test`.`t1`. SELECT t1 FROM test JOIN test2 where test.t1 = test2.t2; -- Return 2 rows of results [3, 4]; @@ -255,28 +270,17 @@ SELECT t1 FROM test JOIN test2 where test.t1 = test2.t2; -- Through the query profile (set enable_profile=true;) you can view the detailed information of the internal work of the query, -- Including whether each Runtime Filter is pushed down, waiting time, -- and the total time from prepare to receiving Runtime Filter for OLAP_SCAN_NODE. -RuntimeFilter:in: - - HasPushDownToEngine: true - - AWaitTimeCost: 0ns - - EffectTimeCost: 2.76ms +RuntimeFilter: (id = 1, type = in_or_bloomfilter): + - Info: [IsPushDown = true, RuntimeFilterState = READY, HasRemoteTarget = false, HasLocalTarget = true, Ignored = false] + - RealRuntimeFilterType: in + - InFilterSize: 3 + - always_true: 0 + - expr_filtered_rows: 0 + - expr_input_rows: 0 +-- expr_input_rows and expr_filtered_rows are both 0 because in filter directly filters the data in advance according to the key range without calculating it row by row. -- In addition, in the OLAP_SCAN_NODE of the profile, you can also view the filtering effect -- and time consumption after the Runtime Filter is pushed down. - RowsVectorPredFiltered: 9.320008M (9320008) - VectorPredEvalTime: 364.39ms ``` - -## Runtime Filter planning rules -1. Only support the generation of Runtime Filter for the equivalent conditions in the join on clause, excluding the Null-safe condition, because it may filter out the null value of the join left table. -2. Does not support pushing down Runtime Filter to the left table of left outer, full outer, and anti join; -3. Does not support src expr or target expr is constant; -4. The equality of src expr and target expr is not supported; -5. The type of src expr is not supported to be equal to `HLL` or `BITMAP`; -6. Currently only supports pushing down Runtime Filter to OlapScanNode; -7. Target expr does not support NULL-checking expressions, such as `COALESCE/IFNULL/CASE`, because when the join on clause of other joins at the upper level of the outer join contains NULL-checking expressions and a Runtime Filter is generated, this Runtime Filter is downloaded Pushing to the left table of outer join may cause incorrect results; -8. The column (slot) in target expr is not supported, and an equivalent column cannot be found in the original table; -9. Column conduction is not supported. This includes two cases: - - First, when the join on clause contains A.k = B.k and B.k = C.k, currently C.k can only be pushed down to B.k, but not to A.k; - - Second, for example, the join on clause contains Aa + Bb = Cc. If Aa can be transmitted to Ba, that is, Aa and Ba are equivalent columns, then you can replace Aa with Ba, and then you can try to push the Runtime Filter down to B ( If Aa and Ba are not equivalent columns, they cannot be pushed down to B, because target expr must be bound to the only join left table); -10. The types of Target expr and src expr must be equal, because Bloom Filter is based on hash, if the types are not equal, it will try to convert the type of target expr to the type of src expr; -11. The Runtime Filter generated by `PlanNode.Conjuncts` is not supported. Unlike HashJoinNode's `eqJoinConjuncts` and `otherJoinConjuncts`, the Runtime Filter generated by `PlanNode.Conjuncts` found in the test that it may cause incorrect results, such as ` When an IN` subquery is converted to a join, the automatically generated join on clause will be stored in `PlanNode.Conjuncts`. At this time, applying Runtime Filter may result in missing some rows in the result. diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/current/query/join-optimization/runtime-filter.md b/i18n/zh-CN/docusaurus-plugin-content-docs/current/query/join-optimization/runtime-filter.md index 113be7f26b4..6c0b6c184a7 100644 --- a/i18n/zh-CN/docusaurus-plugin-content-docs/current/query/join-optimization/runtime-filter.md +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/current/query/join-optimization/runtime-filter.md @@ -26,7 +26,7 @@ under the License. # Runtime Filter -Runtime Filter 旨在为某些 Join 查询在运行时动态生成过滤条件,来减少扫描的数据量,避免不必要的 I/O 和网络传输,从而加速查询。 +Runtime Filter 旨在为某些 Join 查询在运行时动态生成过滤条件,来减少扫描的数据量,避免不必要的 I/O 和计算,从而加速查询。 ## 名词解释 @@ -38,6 +38,8 @@ Runtime Filter 旨在为某些 Join 查询在运行时动态生成过滤条件 - Join on clause: `A join B on A.a=B.b`中的`A.a=B.b`,在查询规划时基于此生成 join conjuncts,包含 join Build 和 Probe 使用的 expr,其中 Build expr 在 Runtime Filter 中称为 src expr,Probe expr 在 Runtime Filter 中称为 target expr。 +- rf: Runtime Filter的缩写。 + ## 原理 Runtime Filter 在查询规划时生成,在 HashJoinNode 中构建,在 ScanNode 中应用。 @@ -84,37 +86,35 @@ Runtime Filter 在查询规划时生成,在 HashJoinNode 中构建,在 ScanN | ``` -可见,和谓词下推、分区裁剪不同,Runtime Filter 是在运行时动态生成的过滤条件,即在查询运行时解析 join on clause 确定过滤表达式,并将表达式广播给正在读取左表的 ScanNode,从而减少扫描的数据量,进而减少 probe hash table 的次数,避免不必要的 I/O 和网络传输。 +可见,和谓词下推、分区裁剪不同,Runtime Filter 是在运行时动态生成的过滤条件,即在查询运行时解析 join on clause 确定过滤表达式,并将表达式广播给正在读取左表的 ScanNode,从而减少扫描的数据量,进而减少 probe hash table 的次数,避免不必要的 I/O 和计算。 -Runtime Filter 主要用于大表 join 小表的优化,如果左表的数据量太小,或者右表的数据量太大,则 Runtime Filter 可能不会取得预期效果。 +Runtime Filter 主要用于大表 join 小表的优化。如果左表的数据量太小,rf的提前过滤效果可能不大。如果右表的数据量太大,则在构建和传输rf时会有比较大的成本。 ## 使用方式 -### Runtime Filter 查询选项 +### Runtime Filter 配置项 -与 Runtime Filter 相关的查询选项信息,请参阅以下部分: +默认的配置已经尽可能的适配了大多数场景。仅在某些特定场景下,才需进一步调整以达到最优效果。通常只在性能测试后,针对资源密集型、运行耗时足够长且频率足够高的查询进行优化。 +与 Runtime Filter 相关的配置选项,请参阅以下部分: + - `enable_sync_runtime_filter_size`: 在优化器无法准确估计基数时,令执行器在生成rf之前同步并获取全局的Build端大小总和,根据这个实际大小来决定 IN Or Bloom Filter 的最终类型和 Bloom Filter 的大小。如果设置为 false 则不做同步操作获取全局大小,该变量默认值为 true 。 -- 第一个查询选项是调整使用的 Runtime Filter 类型,大多数情况下,您只需要调整这一个选项,其他选项保持默认即可。 - - - `runtime_filter_type`: 包括 Bloom Filter、MinMax Filter、IN predicate、IN Or Bloom Filter、Bitmap Filter,默认会使用 IN Or Bloom Filter,部分情况下同时使用 Bloom Filter、MinMax Filter、IN predicate 时性能更高。 + - `runtime_filter_max_in_num`: 如果Build端大小大于这个值,我们将不生成 IN predicate。该变量默认值为 1024 。 -- 其他查询选项通常仅在某些特定场景下,才需进一步调整以达到最优效果。通常只在性能测试后,针对资源密集型、运行耗时足够长且频率足够高的查询进行优化。 + - `runtime_filter_mode`: 用于调整 rf 的生成策略,包括 OFF、LOCAL、GLOBAL 三种策略。如果设置为 OFF 则不会生成rf。该变量默认值为 GLOBAL 。 - - `runtime_filter_mode`: 用于调整 Runtime Filter 的下推策略,包括 OFF、LOCAL、GLOBAL 三种策略,默认设置为 GLOBAL 策略 + - `runtime_filter_type`: 允许生成的rf类型,包括 Bloom Filter、MinMax Filter、IN predicate、IN Or Bloom Filter、Bitmap Filter。该变量默认值为 IN_OR_BLOOM_FILTER,MIN_MAX 。 - - `runtime_filter_wait_time_ms`: 左表的 ScanNode 等待每个 Runtime Filter 的时间,默认 1000ms + - `runtime_filter_wait_infinitely`: 如果设置为 true,那么左表的 scan 节点将会一直等待直到接收到 rf 或者查询超超时,相当于 runtime_filter_wait_time_ms 被设置为无限大。该变量默认值为 false 。 - - `runtime_filters_max_num`: 每个查询可应用的 Runtime Filter 中 Bloom Filter 的最大数量,默认 10 + - `runtime_filter_wait_time_ms`: 左表的 ScanNode 等待 rf 的时间。如果超过了等待时间仍然没有收到 rf,则 ScanNode 会先开始扫描数据,后续收到的rf会对此时刻该 ScanNode 还没有返回的数据生效。该变量默认值为 1000 。 - - `runtime_bloom_filter_min_size`: Runtime Filter 中 Bloom Filter 的最小长度,默认 1048576(1M) + - `runtime_bloom_filter_min_size`: 优化器预估的 rf 中 Bloom Filter 的最小长度,该变量默认值为 1048576(1M)。 - - `runtime_bloom_filter_max_size`: Runtime Filter 中 Bloom Filter 的最大长度,默认 16777216(16M) + - `runtime_bloom_filter_max_size`: 优化器预估的 rf 中 Bloom Filter 的最大长度,该变量默认值为 16777216(16M)。 - - `runtime_bloom_filter_size`: Runtime Filter 中 Bloom Filter 的默认长度,默认 2097152(2M) + - `runtime_bloom_filter_size`: 优化器预估的 rf 中 Bloom Filter 的默认长度,该变量默认值为 2097152(2M)。 - - `runtime_filter_max_in_num`: 如果 join 右表数据行数大于这个值,我们将不生成 IN predicate,默认 1024 - - `runtime_filter_wait_infinitely`: 如果参数为 true,那么左表的 scan 节点将会一直等待直到接收到 runtime filer 或者查询超超时,默认为 false 下面对查询选项做进一步说明。 @@ -138,15 +138,13 @@ set runtime_filter_type=7; - **IN or Bloom Filter**: 根据右表在执行过程中的真实行数,由系统自动判断使用 IN predicate 还是 Bloom Filter - - 默认在右表数据行数少于 102400 时会使用 IN predicate(可通过 session 变量中的`runtime_filter_max_in_num`调整),否则使用 Bloom filter。 + - 默认在右表数据行数少于 runtime_filter_max_in_num 时会使用 IN predicate,否则使用 Bloom filter。 - **Bloom Filter**: 有一定的误判率,导致过滤的数据比预期少一点,但不会导致最终结果不准确,在大部分情况下 Bloom Filter 都可以提升性能或对性能没有显著影响,但在部分情况下会导致性能降低。 - Bloom Filter 构建和应用的开销较高,所以当过滤率较低时,或者左表数据量较少时,Bloom Filter 可能会导致性能降低。 - - - 目前只有左表的 Key 列应用 Bloom Filter 才能下推到存储引擎,而测试结果显示 Bloom Filter 不下推到存储引擎时往往会导致性能降低。 - - - 目前 Bloom Filter 仅在 ScanNode 上使用表达式过滤时有短路 (short-circuit) 逻辑,即当假阳性率过高时,不继续使用 Bloom Filter,但当 Bloom Filter 下推到存储引擎后没有短路逻辑,所以当过滤率较低时可能导致性能降低。 + - Bloom Filter 过大,可能会导致构建/传输/过滤耗时较大。 + - **MinMax Filter**: 包含最大值和最小值,从而过滤小于最小值和大于最大值的数据,MinMax Filter 的过滤效果与 join on clause 中 Key 列的类型和左右表数据分布有关。 @@ -156,9 +154,7 @@ set runtime_filter_type=7; - **IN predicate**: 根据 join on clause 中 Key 列在右表上的所有值构建 IN predicate,使用构建的 IN predicate 在左表上过滤,相比 Bloom Filter 构建和应用的开销更低,在右表数据量较少时往往性能更高。 - - 目前 IN predicate 已实现合并方法。 - - - 当同时指定 In predicate 和其他 filter,并且 in 的过滤数值没达到 runtime_filter_max_in_num 时,会尝试把其他 filter 去除掉。原因是 In predicate 是精确的过滤条件,即使没有其他 filter 也可以高效过滤,如果同时使用则其他 filter 会做无用功。目前仅在 Runtime filter 的生产者和消费者处于同一个 fragment 时才会有去除非 in filter 的逻辑。 + - 当同时指定 In predicate 和其他 filter,并且 in 的过滤数值没达到 runtime_filter_max_in_num 时,会尝试把其他 filter 去除掉。原因是 In predicate 是精确的过滤条件,即使没有其他 filter 也可以高效过滤,如果同时使用则其他 filter 会做无用功。 - **Bitmap Filter**: @@ -190,32 +186,15 @@ Runtime Filter 的等待耗时。 **使用注意事项** -在开启 Runtime Filter 后,左表的 ScanNode 会为每一个分配给自己的 Runtime Filter 等待一段时间再扫描数据,即如果 ScanNode 被分配了 3 个 Runtime Filter,那么它最多会等待 3000ms。 +在开启 Runtime Filter 后,左表的 ScanNode 会为分配给自己的 Runtime Filter 等待一段时间再扫描数据。 因为 Runtime Filter 的构建和合并均需要时间,ScanNode 会尝试将等待时间内到达的 Runtime Filter 下推到存储引擎,如果超过等待时间后,ScanNode 会使用已经到达的 Runtime Filter 直接开始扫描数据。 -如果 Runtime Filter 在 ScanNode 开始扫描之后到达,则 ScanNode 不会将该 Runtime Filter 下推到存储引擎,而是对已经从存储引擎扫描上来的数据,在 ScanNode 上基于该 Runtime Filter 使用表达式过滤,之前已经扫描的数据则不会应用该 Runtime Filter,这样得到的中间数据规模会大于最优解,但可以避免严重的裂化。 +如果 Runtime Filter 在 ScanNode 开始扫描之后到达,则 ScanNode 不会将该 Runtime Filter 下推到存储引擎,而是对已经从存储引擎扫描上来的数据,在 ScanNode 上基于该 Runtime Filter 使用表达式过滤,之前已经扫描的数据则不会应用该 Runtime Filter,这样得到的中间数据规模会大于最优解,但可以避免严重的劣化。 如果集群比较繁忙,并且集群上有许多资源密集型或长耗时的查询,可以考虑增加等待时间,以避免复杂查询错过优化机会。如果集群负载较轻,并且集群上有许多只需要几秒的小查询,可以考虑减少等待时间,以避免每个查询增加 1s 的延迟。 -**4. runtime_filters_max_num** - -每个查询生成的 Runtime Filter 中 Bloom Filter 数量的上限。 - -**类型**: 整数,默认 10 - -**使用注意事项** 目前仅对 Bloom Filter 的数量进行限制,因为相比 MinMax Filter 和 IN predicate,Bloom Filter 构建和应用的代价更高。 - -如果生成的 Bloom Filter 超过允许的最大数量,则保留选择性大的 Bloom Filter,选择性大意味着预期可以过滤更多的行。这个设置可以防止 Bloom Filter 耗费过多的内存开销而导致潜在的问题。 - -```text -选择性=(HashJoinNode Cardinality / HashJoinNode left child Cardinality) --- 因为目前 FE 拿到 Cardinality 不准,所以这里 Bloom Filter 计算的选择性与实际不准,因此最终可能只是随机保留了部分 Bloom Filter。 -``` - -仅在对涉及大表间 join 的某些长耗时查询进行调优时,才需要调整此查询选项。 - -**5. Bloom Filter 长度相关参数** +**4. Bloom Filter 长度相关参数** 包括`runtime_bloom_filter_min_size`、`runtime_bloom_filter_max_size`、`runtime_bloom_filter_size`,用于确定 Runtime Filter 使用的 Bloom Filter 数据结构的大小(以字节为单位)。 @@ -223,7 +202,8 @@ Runtime Filter 的等待耗时。 **使用注意事项** 因为需要保证每个 HashJoinNode 构建的 Bloom Filter 长度相同才能合并,所以目前在 FE 查询规划时计算 Bloom Filter 的长度。 -如果能拿到 join 右表统计信息中的数据行数 (Cardinality),会尝试根据 Cardinality 估计 Bloom Filter 的最佳大小,并四舍五入到最接近的 2 的幂 (以 2 为底的 log 值)。如果无法拿到右表的 Cardinality,则会使用默认的 Bloom Filter 长度`runtime_bloom_filter_size`。`runtime_bloom_filter_min_size`和`runtime_bloom_filter_max_size`用于限制最终使用的 Bloom Filter 长度最小和最大值。 +如果能拿到 join 右表统计信息中的数据行数 (Cardinality),则会尝试根据 Cardinality 估计 Bloom Filter 的最佳大小,并四舍五入到最接近的 2 的幂 (以 2 为底的 log 值)。如果没有准确的统计信息,但是打开了 enable_sync_runtime_filter_size ,会根据实际运行时的数据行数来估计 Bloom Filter 的最佳大小,但是会有一些运行时统计带来的性能开销。 +最后如果仍无法拿到右表的 Cardinality,则会使用默认的 Bloom Filter 长度`runtime_bloom_filter_size`。`runtime_bloom_filter_min_size`和`runtime_bloom_filter_max_size`用于限制最终使用的 Bloom Filter 长度最小和最大值。 更大的 Bloom Filter 在处理高基数的输入集时更有效,但需要消耗更多的内存。假如查询中需要过滤高基数列(比如含有数百万个不同的取值),可以考虑增加`runtime_bloom_filter_size`的值进行一些基准测试,这有助于使 Bloom Filter 过滤的更加精准,从而获得预期的性能提升。 @@ -247,79 +227,81 @@ CREATE TABLE test2 (t2 INT) DISTRIBUTED BY HASH (t2) BUCKETS 2 PROPERTIES("repli INSERT INTO test2 VALUES (3), (4), (5); EXPLAIN SELECT t1 FROM test JOIN test2 where test.t1 = test2.t2; -+-------------------------------------------------------------------+ -| Explain String | -+-------------------------------------------------------------------+ -| PLAN FRAGMENT 0 | -| OUTPUT EXPRS:`t1` | -| | -| 4:EXCHANGE | -| | -| PLAN FRAGMENT 1 | -| OUTPUT EXPRS: | -| PARTITION: HASH_PARTITIONED: `default_cluster:ssb`.`test`.`t1` | -| | -| 2:HASH JOIN | -| | join op: INNER JOIN (BUCKET_SHUFFLE) | -| | equal join conjunct: `test`.`t1` = `test2`.`t2` | -| | runtime filters: RF000[in] <- `test2`.`t2` | -| | | -| |----3:EXCHANGE | -| | | -| 0:OlapScanNode | -| TABLE: test | -| runtime filters: RF000[in] -> `test`.`t1` | -| | -| PLAN FRAGMENT 2 | -| OUTPUT EXPRS: | -| PARTITION: HASH_PARTITIONED: `default_cluster:ssb`.`test2`.`t2` | -| | -| 1:OlapScanNode | -| TABLE: test2 | -+-------------------------------------------------------------------+ --- 上面`runtime filters`的行显示了`PLAN FRAGMENT 1`的`2:HASH JOIN`生成了 ID 为 RF000 的 IN predicate, --- 其中`test2`.`t2`的 key values 仅在运行时可知, --- 在`0:OlapScanNode`使用了该 IN predicate 用于在读取`test`.`t1`时过滤不必要的数据。 ++--------------------------------------------------------------------------------------------------+ +| Explain String(Nereids Planner) | ++--------------------------------------------------------------------------------------------------+ +| PLAN FRAGMENT 0 | +| OUTPUT EXPRS: | +| t1[#4] | +| PARTITION: HASH_PARTITIONED: t1[#1] | +| | +| HAS_COLO_PLAN_NODE: false | +| | +| VRESULT SINK | +| MYSQL_PROTOCAL | +| | +| 3:VHASH JOIN(157) | +| | join op: INNER JOIN(BUCKET_SHUFFLE)[] | +| | equal join conjunct: (t1[#1] = t2[#0]) | +| | runtime filters: RF000[min_max] <- t2[#0](3/4/2048), RF001[in_or_bloom] <- t2[#0](3/4/2048) | +| | cardinality=3 | +| | vec output tuple id: 3 | +| | output tuple id: 3 | +| | vIntermediate tuple ids: 2 | +| | hash output slot ids: 1 | +| | final projections: t1[#2] | +| | final project output tuple id: 3 | +| | distribute expr lists: t1[#1] | +| | distribute expr lists: t2[#0] | +| | | +| |----1:VEXCHANGE | +| | offset: 0 | +| | distribute expr lists: t2[#0] | +| | | +| 2:VOlapScanNode(150) | +| TABLE: test.test(test), PREAGGREGATION: ON | +| runtime filters: RF000[min_max] -> t1[#1], RF001[in_or_bloom] -> t1[#1] | +| partitions=1/1 (test) | +| tablets=2/2, tabletList=61032,61034 | +| cardinality=4, avgRowSize=0.0, numNodes=1 | +| pushAggOp=NONE | +| | +| PLAN FRAGMENT 1 | +| | +| PARTITION: HASH_PARTITIONED: t2[#0] | +| | +| HAS_COLO_PLAN_NODE: false | +| | +| STREAM DATA SINK | +| EXCHANGE ID: 01 | +| BUCKET_SHFFULE_HASH_PARTITIONED: t2[#0] | +| | +| 0:VOlapScanNode(151) | +| TABLE: test.test2(test2), PREAGGREGATION: ON | +| partitions=1/1 (test2) | +| tablets=2/2, tabletList=61041,61043 | +| cardinality=3, avgRowSize=0.0, numNodes=1 | +| pushAggOp=NONE | ++--------------------------------------------------------------------------------------------------+ +-- 上面`runtime filters`的行显示了`PLAN FRAGMENT 1`的`2:HASH JOIN`生成了 ID 为 RF000 的 min_max 和 RF001 的 in_or_bloom, +-- 在`2:VOlapScanNode(150)`使用了 RF000/RF001 用于在读取`test`.`t1`时过滤不必要的数据。 SELECT t1 FROM test JOIN test2 where test.t1 = test2.t2; -- 返回 2 行结果 [3, 4]; -- 通过 query 的 profile(set enable_profile=true;)可以查看查询内部工作的详细信息, -- 包括每个 Runtime Filter 是否下推、等待耗时、以及 OLAP_SCAN_NODE 从 prepare 到接收到 Runtime Filter 的总时长。 -RuntimeFilter:in: - - HasPushDownToEngine: true - - AWaitTimeCost: 0ns - - EffectTimeCost: 2.76ms +RuntimeFilter: (id = 1, type = in_or_bloomfilter): + - Info: [IsPushDown = true, RuntimeFilterState = READY, HasRemoteTarget = false, HasLocalTarget = true, Ignored = false] + - RealRuntimeFilterType: in + - InFilterSize: 3 + - always_true: 0 + - expr_filtered_rows: 0 + - expr_input_rows: 0 +-- 这里的 expr_input_rows 和 expr_filtered_rows 均为 0 是因为 in filter 根据 key range 直接提前过滤了数据,没有经过逐行计算。 + -- 此外,在 profile 的 OLAP_SCAN_NODE 中还可以查看 Runtime Filter 下推后的过滤效果和耗时。 - RowsVectorPredFiltered: 9.320008M (9320008) - VectorPredEvalTime: 364.39ms ``` - -## Runtime Filter 的规划规则 - -1. 只支持对 join on clause 中的等值条件生成 Runtime Filter,不包括 Null-safe 条件,因为其可能会过滤掉 join 左表的 null 值。 - -2. 不支持将 Runtime Filter 下推到 left outer、full outer、anti join 的左表; - -3. 不支持 src expr 或 target expr 是常量; - -4. 不支持 src expr 和 target expr 相等; - -5. 不支持 src expr 的类型等于`HLL`或者`BITMAP`; - -6. 目前仅支持将 Runtime Filter 下推给 OlapScanNode; - -7. 不支持 target expr 包含 NULL-checking 表达式,比如`COALESCE/IFNULL/CASE`,因为当 outer join 上层其他 join 的 join on clause 包含 NULL-checking 表达式并生成 Runtime Filter 时,将这个 Runtime Filter 下推到 outer join 的左表时可能导致结果不正确; - -8. 不支持 target expr 中的列(slot)无法在原始表中找到某个等价列; - -9. 不支持列传导,这包含两种情况: - - - 一是例如 join on clause 包含 A.k = B.k and B.k = C.k 时,目前 C.k 只可以下推给 B.k,而不可以下推给 A.k; - - - 二是例如 join on clause 包含 A.a + B.b = C.c,如果 A.a 可以列传导到 B.a,即 A.a 和 B.a 是等价的列,那么可以用 B.a 替换 A.a,然后可以尝试将 Runtime Filter 下推给 B(如果 A.a 和 B.a 不是等价列,则不能下推给 B,因为 target expr 必须与唯一一个 join 左表绑定); - -10. Target expr 和 src expr 的类型必须相等,因为 Bloom Filter 基于 hash,若类型不等则会尝试将 target expr 的类型转换为 src expr 的类型; - -11. 不支持`PlanNode.Conjuncts`生成的 Runtime Filter 下推,与 HashJoinNode 的`eqJoinConjuncts`和`otherJoinConjuncts`不同,`PlanNode.Conjuncts`生成的 Runtime Filter 在测试中发现可能会导致错误的结果,例如`IN`子查询转换为 join 时,自动生成的 join on clause 将保存在`PlanNode.Conjuncts`中,此时应用 Runtime Filter 可能会导致结果缺少一些行。 diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/version-2.1/query/join-optimization/runtime-filter.md b/i18n/zh-CN/docusaurus-plugin-content-docs/version-2.1/query/join-optimization/runtime-filter.md index 113be7f26b4..6c0b6c184a7 100644 --- a/i18n/zh-CN/docusaurus-plugin-content-docs/version-2.1/query/join-optimization/runtime-filter.md +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/version-2.1/query/join-optimization/runtime-filter.md @@ -26,7 +26,7 @@ under the License. # Runtime Filter -Runtime Filter 旨在为某些 Join 查询在运行时动态生成过滤条件,来减少扫描的数据量,避免不必要的 I/O 和网络传输,从而加速查询。 +Runtime Filter 旨在为某些 Join 查询在运行时动态生成过滤条件,来减少扫描的数据量,避免不必要的 I/O 和计算,从而加速查询。 ## 名词解释 @@ -38,6 +38,8 @@ Runtime Filter 旨在为某些 Join 查询在运行时动态生成过滤条件 - Join on clause: `A join B on A.a=B.b`中的`A.a=B.b`,在查询规划时基于此生成 join conjuncts,包含 join Build 和 Probe 使用的 expr,其中 Build expr 在 Runtime Filter 中称为 src expr,Probe expr 在 Runtime Filter 中称为 target expr。 +- rf: Runtime Filter的缩写。 + ## 原理 Runtime Filter 在查询规划时生成,在 HashJoinNode 中构建,在 ScanNode 中应用。 @@ -84,37 +86,35 @@ Runtime Filter 在查询规划时生成,在 HashJoinNode 中构建,在 ScanN | ``` -可见,和谓词下推、分区裁剪不同,Runtime Filter 是在运行时动态生成的过滤条件,即在查询运行时解析 join on clause 确定过滤表达式,并将表达式广播给正在读取左表的 ScanNode,从而减少扫描的数据量,进而减少 probe hash table 的次数,避免不必要的 I/O 和网络传输。 +可见,和谓词下推、分区裁剪不同,Runtime Filter 是在运行时动态生成的过滤条件,即在查询运行时解析 join on clause 确定过滤表达式,并将表达式广播给正在读取左表的 ScanNode,从而减少扫描的数据量,进而减少 probe hash table 的次数,避免不必要的 I/O 和计算。 -Runtime Filter 主要用于大表 join 小表的优化,如果左表的数据量太小,或者右表的数据量太大,则 Runtime Filter 可能不会取得预期效果。 +Runtime Filter 主要用于大表 join 小表的优化。如果左表的数据量太小,rf的提前过滤效果可能不大。如果右表的数据量太大,则在构建和传输rf时会有比较大的成本。 ## 使用方式 -### Runtime Filter 查询选项 +### Runtime Filter 配置项 -与 Runtime Filter 相关的查询选项信息,请参阅以下部分: +默认的配置已经尽可能的适配了大多数场景。仅在某些特定场景下,才需进一步调整以达到最优效果。通常只在性能测试后,针对资源密集型、运行耗时足够长且频率足够高的查询进行优化。 +与 Runtime Filter 相关的配置选项,请参阅以下部分: + - `enable_sync_runtime_filter_size`: 在优化器无法准确估计基数时,令执行器在生成rf之前同步并获取全局的Build端大小总和,根据这个实际大小来决定 IN Or Bloom Filter 的最终类型和 Bloom Filter 的大小。如果设置为 false 则不做同步操作获取全局大小,该变量默认值为 true 。 -- 第一个查询选项是调整使用的 Runtime Filter 类型,大多数情况下,您只需要调整这一个选项,其他选项保持默认即可。 - - - `runtime_filter_type`: 包括 Bloom Filter、MinMax Filter、IN predicate、IN Or Bloom Filter、Bitmap Filter,默认会使用 IN Or Bloom Filter,部分情况下同时使用 Bloom Filter、MinMax Filter、IN predicate 时性能更高。 + - `runtime_filter_max_in_num`: 如果Build端大小大于这个值,我们将不生成 IN predicate。该变量默认值为 1024 。 -- 其他查询选项通常仅在某些特定场景下,才需进一步调整以达到最优效果。通常只在性能测试后,针对资源密集型、运行耗时足够长且频率足够高的查询进行优化。 + - `runtime_filter_mode`: 用于调整 rf 的生成策略,包括 OFF、LOCAL、GLOBAL 三种策略。如果设置为 OFF 则不会生成rf。该变量默认值为 GLOBAL 。 - - `runtime_filter_mode`: 用于调整 Runtime Filter 的下推策略,包括 OFF、LOCAL、GLOBAL 三种策略,默认设置为 GLOBAL 策略 + - `runtime_filter_type`: 允许生成的rf类型,包括 Bloom Filter、MinMax Filter、IN predicate、IN Or Bloom Filter、Bitmap Filter。该变量默认值为 IN_OR_BLOOM_FILTER,MIN_MAX 。 - - `runtime_filter_wait_time_ms`: 左表的 ScanNode 等待每个 Runtime Filter 的时间,默认 1000ms + - `runtime_filter_wait_infinitely`: 如果设置为 true,那么左表的 scan 节点将会一直等待直到接收到 rf 或者查询超超时,相当于 runtime_filter_wait_time_ms 被设置为无限大。该变量默认值为 false 。 - - `runtime_filters_max_num`: 每个查询可应用的 Runtime Filter 中 Bloom Filter 的最大数量,默认 10 + - `runtime_filter_wait_time_ms`: 左表的 ScanNode 等待 rf 的时间。如果超过了等待时间仍然没有收到 rf,则 ScanNode 会先开始扫描数据,后续收到的rf会对此时刻该 ScanNode 还没有返回的数据生效。该变量默认值为 1000 。 - - `runtime_bloom_filter_min_size`: Runtime Filter 中 Bloom Filter 的最小长度,默认 1048576(1M) + - `runtime_bloom_filter_min_size`: 优化器预估的 rf 中 Bloom Filter 的最小长度,该变量默认值为 1048576(1M)。 - - `runtime_bloom_filter_max_size`: Runtime Filter 中 Bloom Filter 的最大长度,默认 16777216(16M) + - `runtime_bloom_filter_max_size`: 优化器预估的 rf 中 Bloom Filter 的最大长度,该变量默认值为 16777216(16M)。 - - `runtime_bloom_filter_size`: Runtime Filter 中 Bloom Filter 的默认长度,默认 2097152(2M) + - `runtime_bloom_filter_size`: 优化器预估的 rf 中 Bloom Filter 的默认长度,该变量默认值为 2097152(2M)。 - - `runtime_filter_max_in_num`: 如果 join 右表数据行数大于这个值,我们将不生成 IN predicate,默认 1024 - - `runtime_filter_wait_infinitely`: 如果参数为 true,那么左表的 scan 节点将会一直等待直到接收到 runtime filer 或者查询超超时,默认为 false 下面对查询选项做进一步说明。 @@ -138,15 +138,13 @@ set runtime_filter_type=7; - **IN or Bloom Filter**: 根据右表在执行过程中的真实行数,由系统自动判断使用 IN predicate 还是 Bloom Filter - - 默认在右表数据行数少于 102400 时会使用 IN predicate(可通过 session 变量中的`runtime_filter_max_in_num`调整),否则使用 Bloom filter。 + - 默认在右表数据行数少于 runtime_filter_max_in_num 时会使用 IN predicate,否则使用 Bloom filter。 - **Bloom Filter**: 有一定的误判率,导致过滤的数据比预期少一点,但不会导致最终结果不准确,在大部分情况下 Bloom Filter 都可以提升性能或对性能没有显著影响,但在部分情况下会导致性能降低。 - Bloom Filter 构建和应用的开销较高,所以当过滤率较低时,或者左表数据量较少时,Bloom Filter 可能会导致性能降低。 - - - 目前只有左表的 Key 列应用 Bloom Filter 才能下推到存储引擎,而测试结果显示 Bloom Filter 不下推到存储引擎时往往会导致性能降低。 - - - 目前 Bloom Filter 仅在 ScanNode 上使用表达式过滤时有短路 (short-circuit) 逻辑,即当假阳性率过高时,不继续使用 Bloom Filter,但当 Bloom Filter 下推到存储引擎后没有短路逻辑,所以当过滤率较低时可能导致性能降低。 + - Bloom Filter 过大,可能会导致构建/传输/过滤耗时较大。 + - **MinMax Filter**: 包含最大值和最小值,从而过滤小于最小值和大于最大值的数据,MinMax Filter 的过滤效果与 join on clause 中 Key 列的类型和左右表数据分布有关。 @@ -156,9 +154,7 @@ set runtime_filter_type=7; - **IN predicate**: 根据 join on clause 中 Key 列在右表上的所有值构建 IN predicate,使用构建的 IN predicate 在左表上过滤,相比 Bloom Filter 构建和应用的开销更低,在右表数据量较少时往往性能更高。 - - 目前 IN predicate 已实现合并方法。 - - - 当同时指定 In predicate 和其他 filter,并且 in 的过滤数值没达到 runtime_filter_max_in_num 时,会尝试把其他 filter 去除掉。原因是 In predicate 是精确的过滤条件,即使没有其他 filter 也可以高效过滤,如果同时使用则其他 filter 会做无用功。目前仅在 Runtime filter 的生产者和消费者处于同一个 fragment 时才会有去除非 in filter 的逻辑。 + - 当同时指定 In predicate 和其他 filter,并且 in 的过滤数值没达到 runtime_filter_max_in_num 时,会尝试把其他 filter 去除掉。原因是 In predicate 是精确的过滤条件,即使没有其他 filter 也可以高效过滤,如果同时使用则其他 filter 会做无用功。 - **Bitmap Filter**: @@ -190,32 +186,15 @@ Runtime Filter 的等待耗时。 **使用注意事项** -在开启 Runtime Filter 后,左表的 ScanNode 会为每一个分配给自己的 Runtime Filter 等待一段时间再扫描数据,即如果 ScanNode 被分配了 3 个 Runtime Filter,那么它最多会等待 3000ms。 +在开启 Runtime Filter 后,左表的 ScanNode 会为分配给自己的 Runtime Filter 等待一段时间再扫描数据。 因为 Runtime Filter 的构建和合并均需要时间,ScanNode 会尝试将等待时间内到达的 Runtime Filter 下推到存储引擎,如果超过等待时间后,ScanNode 会使用已经到达的 Runtime Filter 直接开始扫描数据。 -如果 Runtime Filter 在 ScanNode 开始扫描之后到达,则 ScanNode 不会将该 Runtime Filter 下推到存储引擎,而是对已经从存储引擎扫描上来的数据,在 ScanNode 上基于该 Runtime Filter 使用表达式过滤,之前已经扫描的数据则不会应用该 Runtime Filter,这样得到的中间数据规模会大于最优解,但可以避免严重的裂化。 +如果 Runtime Filter 在 ScanNode 开始扫描之后到达,则 ScanNode 不会将该 Runtime Filter 下推到存储引擎,而是对已经从存储引擎扫描上来的数据,在 ScanNode 上基于该 Runtime Filter 使用表达式过滤,之前已经扫描的数据则不会应用该 Runtime Filter,这样得到的中间数据规模会大于最优解,但可以避免严重的劣化。 如果集群比较繁忙,并且集群上有许多资源密集型或长耗时的查询,可以考虑增加等待时间,以避免复杂查询错过优化机会。如果集群负载较轻,并且集群上有许多只需要几秒的小查询,可以考虑减少等待时间,以避免每个查询增加 1s 的延迟。 -**4. runtime_filters_max_num** - -每个查询生成的 Runtime Filter 中 Bloom Filter 数量的上限。 - -**类型**: 整数,默认 10 - -**使用注意事项** 目前仅对 Bloom Filter 的数量进行限制,因为相比 MinMax Filter 和 IN predicate,Bloom Filter 构建和应用的代价更高。 - -如果生成的 Bloom Filter 超过允许的最大数量,则保留选择性大的 Bloom Filter,选择性大意味着预期可以过滤更多的行。这个设置可以防止 Bloom Filter 耗费过多的内存开销而导致潜在的问题。 - -```text -选择性=(HashJoinNode Cardinality / HashJoinNode left child Cardinality) --- 因为目前 FE 拿到 Cardinality 不准,所以这里 Bloom Filter 计算的选择性与实际不准,因此最终可能只是随机保留了部分 Bloom Filter。 -``` - -仅在对涉及大表间 join 的某些长耗时查询进行调优时,才需要调整此查询选项。 - -**5. Bloom Filter 长度相关参数** +**4. Bloom Filter 长度相关参数** 包括`runtime_bloom_filter_min_size`、`runtime_bloom_filter_max_size`、`runtime_bloom_filter_size`,用于确定 Runtime Filter 使用的 Bloom Filter 数据结构的大小(以字节为单位)。 @@ -223,7 +202,8 @@ Runtime Filter 的等待耗时。 **使用注意事项** 因为需要保证每个 HashJoinNode 构建的 Bloom Filter 长度相同才能合并,所以目前在 FE 查询规划时计算 Bloom Filter 的长度。 -如果能拿到 join 右表统计信息中的数据行数 (Cardinality),会尝试根据 Cardinality 估计 Bloom Filter 的最佳大小,并四舍五入到最接近的 2 的幂 (以 2 为底的 log 值)。如果无法拿到右表的 Cardinality,则会使用默认的 Bloom Filter 长度`runtime_bloom_filter_size`。`runtime_bloom_filter_min_size`和`runtime_bloom_filter_max_size`用于限制最终使用的 Bloom Filter 长度最小和最大值。 +如果能拿到 join 右表统计信息中的数据行数 (Cardinality),则会尝试根据 Cardinality 估计 Bloom Filter 的最佳大小,并四舍五入到最接近的 2 的幂 (以 2 为底的 log 值)。如果没有准确的统计信息,但是打开了 enable_sync_runtime_filter_size ,会根据实际运行时的数据行数来估计 Bloom Filter 的最佳大小,但是会有一些运行时统计带来的性能开销。 +最后如果仍无法拿到右表的 Cardinality,则会使用默认的 Bloom Filter 长度`runtime_bloom_filter_size`。`runtime_bloom_filter_min_size`和`runtime_bloom_filter_max_size`用于限制最终使用的 Bloom Filter 长度最小和最大值。 更大的 Bloom Filter 在处理高基数的输入集时更有效,但需要消耗更多的内存。假如查询中需要过滤高基数列(比如含有数百万个不同的取值),可以考虑增加`runtime_bloom_filter_size`的值进行一些基准测试,这有助于使 Bloom Filter 过滤的更加精准,从而获得预期的性能提升。 @@ -247,79 +227,81 @@ CREATE TABLE test2 (t2 INT) DISTRIBUTED BY HASH (t2) BUCKETS 2 PROPERTIES("repli INSERT INTO test2 VALUES (3), (4), (5); EXPLAIN SELECT t1 FROM test JOIN test2 where test.t1 = test2.t2; -+-------------------------------------------------------------------+ -| Explain String | -+-------------------------------------------------------------------+ -| PLAN FRAGMENT 0 | -| OUTPUT EXPRS:`t1` | -| | -| 4:EXCHANGE | -| | -| PLAN FRAGMENT 1 | -| OUTPUT EXPRS: | -| PARTITION: HASH_PARTITIONED: `default_cluster:ssb`.`test`.`t1` | -| | -| 2:HASH JOIN | -| | join op: INNER JOIN (BUCKET_SHUFFLE) | -| | equal join conjunct: `test`.`t1` = `test2`.`t2` | -| | runtime filters: RF000[in] <- `test2`.`t2` | -| | | -| |----3:EXCHANGE | -| | | -| 0:OlapScanNode | -| TABLE: test | -| runtime filters: RF000[in] -> `test`.`t1` | -| | -| PLAN FRAGMENT 2 | -| OUTPUT EXPRS: | -| PARTITION: HASH_PARTITIONED: `default_cluster:ssb`.`test2`.`t2` | -| | -| 1:OlapScanNode | -| TABLE: test2 | -+-------------------------------------------------------------------+ --- 上面`runtime filters`的行显示了`PLAN FRAGMENT 1`的`2:HASH JOIN`生成了 ID 为 RF000 的 IN predicate, --- 其中`test2`.`t2`的 key values 仅在运行时可知, --- 在`0:OlapScanNode`使用了该 IN predicate 用于在读取`test`.`t1`时过滤不必要的数据。 ++--------------------------------------------------------------------------------------------------+ +| Explain String(Nereids Planner) | ++--------------------------------------------------------------------------------------------------+ +| PLAN FRAGMENT 0 | +| OUTPUT EXPRS: | +| t1[#4] | +| PARTITION: HASH_PARTITIONED: t1[#1] | +| | +| HAS_COLO_PLAN_NODE: false | +| | +| VRESULT SINK | +| MYSQL_PROTOCAL | +| | +| 3:VHASH JOIN(157) | +| | join op: INNER JOIN(BUCKET_SHUFFLE)[] | +| | equal join conjunct: (t1[#1] = t2[#0]) | +| | runtime filters: RF000[min_max] <- t2[#0](3/4/2048), RF001[in_or_bloom] <- t2[#0](3/4/2048) | +| | cardinality=3 | +| | vec output tuple id: 3 | +| | output tuple id: 3 | +| | vIntermediate tuple ids: 2 | +| | hash output slot ids: 1 | +| | final projections: t1[#2] | +| | final project output tuple id: 3 | +| | distribute expr lists: t1[#1] | +| | distribute expr lists: t2[#0] | +| | | +| |----1:VEXCHANGE | +| | offset: 0 | +| | distribute expr lists: t2[#0] | +| | | +| 2:VOlapScanNode(150) | +| TABLE: test.test(test), PREAGGREGATION: ON | +| runtime filters: RF000[min_max] -> t1[#1], RF001[in_or_bloom] -> t1[#1] | +| partitions=1/1 (test) | +| tablets=2/2, tabletList=61032,61034 | +| cardinality=4, avgRowSize=0.0, numNodes=1 | +| pushAggOp=NONE | +| | +| PLAN FRAGMENT 1 | +| | +| PARTITION: HASH_PARTITIONED: t2[#0] | +| | +| HAS_COLO_PLAN_NODE: false | +| | +| STREAM DATA SINK | +| EXCHANGE ID: 01 | +| BUCKET_SHFFULE_HASH_PARTITIONED: t2[#0] | +| | +| 0:VOlapScanNode(151) | +| TABLE: test.test2(test2), PREAGGREGATION: ON | +| partitions=1/1 (test2) | +| tablets=2/2, tabletList=61041,61043 | +| cardinality=3, avgRowSize=0.0, numNodes=1 | +| pushAggOp=NONE | ++--------------------------------------------------------------------------------------------------+ +-- 上面`runtime filters`的行显示了`PLAN FRAGMENT 1`的`2:HASH JOIN`生成了 ID 为 RF000 的 min_max 和 RF001 的 in_or_bloom, +-- 在`2:VOlapScanNode(150)`使用了 RF000/RF001 用于在读取`test`.`t1`时过滤不必要的数据。 SELECT t1 FROM test JOIN test2 where test.t1 = test2.t2; -- 返回 2 行结果 [3, 4]; -- 通过 query 的 profile(set enable_profile=true;)可以查看查询内部工作的详细信息, -- 包括每个 Runtime Filter 是否下推、等待耗时、以及 OLAP_SCAN_NODE 从 prepare 到接收到 Runtime Filter 的总时长。 -RuntimeFilter:in: - - HasPushDownToEngine: true - - AWaitTimeCost: 0ns - - EffectTimeCost: 2.76ms +RuntimeFilter: (id = 1, type = in_or_bloomfilter): + - Info: [IsPushDown = true, RuntimeFilterState = READY, HasRemoteTarget = false, HasLocalTarget = true, Ignored = false] + - RealRuntimeFilterType: in + - InFilterSize: 3 + - always_true: 0 + - expr_filtered_rows: 0 + - expr_input_rows: 0 +-- 这里的 expr_input_rows 和 expr_filtered_rows 均为 0 是因为 in filter 根据 key range 直接提前过滤了数据,没有经过逐行计算。 + -- 此外,在 profile 的 OLAP_SCAN_NODE 中还可以查看 Runtime Filter 下推后的过滤效果和耗时。 - RowsVectorPredFiltered: 9.320008M (9320008) - VectorPredEvalTime: 364.39ms ``` - -## Runtime Filter 的规划规则 - -1. 只支持对 join on clause 中的等值条件生成 Runtime Filter,不包括 Null-safe 条件,因为其可能会过滤掉 join 左表的 null 值。 - -2. 不支持将 Runtime Filter 下推到 left outer、full outer、anti join 的左表; - -3. 不支持 src expr 或 target expr 是常量; - -4. 不支持 src expr 和 target expr 相等; - -5. 不支持 src expr 的类型等于`HLL`或者`BITMAP`; - -6. 目前仅支持将 Runtime Filter 下推给 OlapScanNode; - -7. 不支持 target expr 包含 NULL-checking 表达式,比如`COALESCE/IFNULL/CASE`,因为当 outer join 上层其他 join 的 join on clause 包含 NULL-checking 表达式并生成 Runtime Filter 时,将这个 Runtime Filter 下推到 outer join 的左表时可能导致结果不正确; - -8. 不支持 target expr 中的列(slot)无法在原始表中找到某个等价列; - -9. 不支持列传导,这包含两种情况: - - - 一是例如 join on clause 包含 A.k = B.k and B.k = C.k 时,目前 C.k 只可以下推给 B.k,而不可以下推给 A.k; - - - 二是例如 join on clause 包含 A.a + B.b = C.c,如果 A.a 可以列传导到 B.a,即 A.a 和 B.a 是等价的列,那么可以用 B.a 替换 A.a,然后可以尝试将 Runtime Filter 下推给 B(如果 A.a 和 B.a 不是等价列,则不能下推给 B,因为 target expr 必须与唯一一个 join 左表绑定); - -10. Target expr 和 src expr 的类型必须相等,因为 Bloom Filter 基于 hash,若类型不等则会尝试将 target expr 的类型转换为 src expr 的类型; - -11. 不支持`PlanNode.Conjuncts`生成的 Runtime Filter 下推,与 HashJoinNode 的`eqJoinConjuncts`和`otherJoinConjuncts`不同,`PlanNode.Conjuncts`生成的 Runtime Filter 在测试中发现可能会导致错误的结果,例如`IN`子查询转换为 join 时,自动生成的 join on clause 将保存在`PlanNode.Conjuncts`中,此时应用 Runtime Filter 可能会导致结果缺少一些行。 diff --git a/versioned_docs/version-2.1/query/join-optimization/runtime-filter.md b/versioned_docs/version-2.1/query/join-optimization/runtime-filter.md index 8cf514fbe68..2d46d03af2f 100644 --- a/versioned_docs/version-2.1/query/join-optimization/runtime-filter.md +++ b/versioned_docs/version-2.1/query/join-optimization/runtime-filter.md @@ -26,15 +26,14 @@ under the License. # Runtime Filter -Runtime Filter is designed to dynamically generate filter conditions for certain Join queries at runtime to reduce the amount of scanned data, avoid unnecessary I/O and network transmission, and speed up the query. - +Runtime Filter is designed to dynamically generate filter conditions for certain Join queries at runtime to reduce the amount of data scanned, avoid unnecessary I/O and calculations, and thereby speed up queries. ## Noun Interpretation * Left table: the table on the left during Join query. Perform Probe operation. The order can be adjusted by Join Reorder. * Right table: the table on the right during Join query. Perform the Build operation. The order can be adjusted by Join Reorder. * Fragment: FE will convert the execution of specific SQL statements into corresponding fragments and send them to BE for execution. The corresponding Fragment is executed on the BE, and the results are aggregated and returned to the FE. * Join on clause: `Aa=Bb` in `A join B on Aa=Bb`, based on this to generate join conjuncts during query planning, including expr used by join Build and Probe, where Build expr is called in Runtime Filter src expr, Probe expr are called target expr in Runtime Filter. - +- rf: Abbreviation of Runtime Filter. ## Principle Runtime Filter is generated during query planning, constructed in HashJoinNode, and applied in ScanNode. @@ -75,35 +74,35 @@ If the filter condition (Runtime Filter) can be pushed down to the storage engin | T1 T2 | ``` -It can be seen that, unlike predicate push-down and partition cutting, Runtime Filter is a filter condition dynamically generated at runtime, that is, when the query is run, the join on clause is parsed to determine the filter expression, and the expression is broadcast to ScanNode that is reading the left table , Thereby reducing the amount of scanned data, thereby reducing the number of probe hash table, avoiding unnecessary I/O and network transmission. +It can be seen that, unlike predicate pushdown and partition pruning, Runtime Filter is a filter condition dynamically generated at runtime, that is the join on clause is parsed to determine the filter expression when the query is running, and the expression is broadcast to the ScanNode that is reading the left table , thereby reducing the amount of data scanned, thereby reducing the number of probe hash tables and avoiding unnecessary I/O and calculations. -Runtime Filter is mainly used to optimize joins for large tables. If the amount of data in the left table is too small, or the amount of data in the right table is too large, the Runtime Filter may not achieve the expected effect. +Runtime Filter is mainly used to optimize the join of large tables and small tables. If the amount of data in the left table is too small, the effect of rf's early filtering may not be great. If the amount of data in the right table is too large, there will be a relatively large cost when building and transmitting rf. ## Usage ### Runtime Filter query options -For query options related to Runtime Filter, please refer to the following sections: +The default configuration has been adapted to most scenarios as much as possible. Only in some specific scenarios, further adjustments are required to achieve the best results. Usually, optimization is only performed for resource-intensive queries that take a long enough time to run and are frequent enough after performance testing. -- The first query option is to adjust the type of Runtime Filter used. In most cases, you only need to adjust this option, and keep the other options as default. +For configuration options related to Runtime Filter, please refer to the following section: - - `runtime_filter_type`: Including Bloom Filter, MinMax Filter, IN predicate, IN_OR_BLOOM Filter and Bitmap_Filter. By default, only IN_OR_BLOOM Filter will be used. In some cases, the performance will be higher when both Bloom Filter, MinMax Filter and IN predicate are used at the same time. +- `enable_sync_runtime_filter_size`: When the optimizer cannot accurately estimate the cardinality, the executor is required to synchronize and obtain the global Build end size before generating rf, and determine the final type of IN Or Bloom Filter and the size of Bloom Filter based on this actual size. If set to false, no synchronization operation is performed to obtain the global size. The default value of this variable is true. -- Other query options usually only need to be further adjusted in certain specific scenarios to achieve the best results. Usually only after performance testing, optimize for resource-intensive, long enough running time and high enough frequency queries. +- `runtime_filter_max_in_num`: If the Build-side size is larger than this value, we will not generate IN predicate. The default value of this variable is 1024. - - `runtime_filter_mode`: Used to adjust the push-down strategy of Runtime Filter, including three strategies of OFF, LOCAL, and GLOBAL. The default setting is the GLOBAL strategy +- `runtime_filter_mode`: Used to adjust the generation strategy of rf, including OFF, LOCAL, and GLOBAL. If set to OFF, rf will not be generated. The default value of this variable is GLOBAL. - - `runtime_filter_wait_time_ms`: the time that ScanNode in the left table waits for each Runtime Filter, the default is 1000ms +- `runtime_filter_type`: The types of rf allowed to be generated, including Bloom Filter, MinMax Filter, IN predicate, IN Or Bloom Filter, and Bitmap Filter. The default value of this variable is IN_OR_BLOOM_FILTER,MIN_MAX. - - `runtime_filters_max_num`: The maximum number of Bloom Filters in the Runtime Filter that can be applied to each query, the default is 10 +- `runtime_filter_wait_infinitely`: If set to true, the scan node of the left table will wait until rf is received or the query times out, which is equivalent to runtime_filter_wait_time_ms being set to infinity. The default value of this variable is false. - - `runtime_bloom_filter_min_size`: the minimum length of Bloom Filter in Runtime Filter, default 1048576 (1M) +- `runtime_filter_wait_time_ms`: The time the ScanNode of the left table waits for rf. If the waiting time has passed and no rf is received, the ScanNode will start scanning the data first, and the rf received later will take effect on the data that the ScanNode has not returned at this moment. The default value of this variable is 1000. - - `runtime_bloom_filter_max_size`: the maximum length of Bloom Filter in Runtime Filter, the default is 16777216 (16M) +- `runtime_bloom_filter_min_size`: The minimum length of the Bloom Filter in the rf estimated by the optimizer. The default value of this variable is 1048576 (1M). - - `runtime_bloom_filter_size`: The default length of Bloom Filter in Runtime Filter, the default is 2097152 (2M) +- `runtime_bloom_filter_max_size`: The maximum length of the Bloom Filter in the rf estimated by the optimizer. The default value of this variable is 16777216 (16M). - - `runtime_filter_max_in_num`: If the number of rows in the right table of the join is greater than this value, we will not generate an IN predicate, the default is 1024 +- `runtime_bloom_filter_size`: The default length of the Bloom Filter in the rf estimated by the optimizer. The default value of this variable is 2097152 (2M). The query options are further explained below. @@ -121,23 +120,28 @@ set runtime_filter_type=7; **Precautions for use** -- **IN or Bloom Filter**: According to the actual number of rows in the right table during execution, the system automatically determines whether to use IN predicate or Bloom Filter. - - By default, IN Predicate will be used when the number of data rows in the right table is less than 102400 (which can be adjusted by ` runtime_filter_max_in_num 'in the session variable). Otherwise, use bloom filter. -- **Bloom Filter**: There is a certain misjudgment rate, which results in the filtered data being a little less than expected, but it will not cause the final result to be inaccurate. In most cases, Bloom Filter can improve performance or has no significant impact on performance, but in some cases Under circumstances will cause performance degradation. - - Bloom Filter construction and application overhead is high, so when the filtering rate is low, or the amount of data in the left table is small, Bloom Filter may cause performance degradation. - - At present, only the Key column of the left table can be pushed down to the storage engine if the Bloom Filter is applied, and the test results show that the performance of the Bloom Filter is often reduced when the Bloom Filter is not pushed down to the storage engine. - - Currently Bloom Filter only has short-circuit logic when using expression filtering on ScanNode, that is, when the false positive rate is too high, the Bloom Filter will not continue to be used, but there is no short-circuit logic when the Bloom Filter is pushed down to the storage engine , So when the filtration rate is low, it may cause performance degradation. +- **IN or Bloom Filter**: Based on the actual number of rows in the right table during execution, the system automatically determines whether to use IN predicate or Bloom Filter. + + - By default, IN predicate will be used when the number of data rows in the right table is less than runtime_filter_max_in_num, otherwise Bloom filter will be used. + +- **Bloom Filter**: There is a certain misjudgment rate, resulting in a little less filtered data than expected, but it will not cause the final result to be inaccurate. In most cases, Bloom Filter can improve performance or have no significant impact on performance. Impact, but may result in reduced performance in some cases. + + - Bloom Filter construction and application overhead is high, so when the filtering rate is low, or when the amount of data in the left table is small, Bloom Filter may cause performance degradation. + - If the Bloom Filter is too large, it may take longer to build/transmit/filter. + + +- **MinMax Filter**: Contains the maximum value and the minimum value, thereby filtering data smaller than the minimum value and larger than the maximum value. The filtering effect of MinMax Filter is related to the type of the Key column in the join on clause and the data distribution of the left and right tables. + + - When the type of the Key column in the join on clause is int/bigint/double, etc., in extreme cases, if the maximum and minimum values of the left and right tables are the same, there will be no effect. Otherwise, the maximum value of the right table is smaller than the minimum value of the left table, or the right table is the smallest. If the value is greater than the maximum value in the left table, the effect will be best. + + - When the type of the Key column in the join on clause is varchar, etc., applying MinMax Filter will often lead to performance degradation. -- **MinMax Filter**: Contains the maximum value and the minimum value, thereby filtering data smaller than the minimum value and greater than the maximum value. The filtering effect of the MinMax Filter is related to the type of the Key column in the join on clause and the data distribution of the left and right tables. - - When the type of the Key column in the join on clause is int/bigint/double, etc., in extreme cases, if the maximum and minimum values of the left and right tables are the same, there is no effect, otherwise the maximum value of the right table is less than the minimum value of the left table, or the minimum of the right table The value is greater than the maximum value in the left table, the effect is best. - - When the type of the Key column in the join on clause is varchar, etc., applying the MinMax Filter will often cause performance degradation. +- **IN predicate**: Construct an IN predicate based on all the values of the Key column in the join on clause on the right table, and use the constructed IN predicate to filter on the left table. Compared with Bloom Filter, the construction and application overhead is lower. Performance is often higher when the amount of data in the right table is smaller. -- **IN predicate**: Construct IN predicate based on all the values of Key listed in the join on clause on the right table, and use the constructed IN predicate to filter on the left table. Compared with Bloom Filter, the cost of construction and application is lower. The amount of data in the right table is lower. When it is less, it tends to perform better. - - Currently IN predicate already implement a merge method. - - When IN predicate and other filters are specified at the same time, and the filtering value of IN predicate does not reach runtime_filter_max_in_num will try to remove other filters. The reason is that IN predicate is an accurate filtering condition. Even if there is no other filter, it can filter efficiently. If it is used at the same time, other filters will do useless work. Currently, only when the producer and consumer of the runtime filter are in the same fragment can there be [...] + - When In predicate and other filters are specified at the same time, and the filter value of in does not reach runtime_filter_max_in_num, other filters will be tried to be removed. The reason is that In predicate is a precise filtering condition, which can filter efficiently even without other filters. If used at the same time, other filters will do useless work. - **Bitmap Filter**: - - Currently, the bitmap filter is used only when the subquery in the [in subquery](../../sql-manual/sql-statements/Operators/in) operation returns a bitmap column. + - Bitmap filter is currently only used when the subquery in the [in subquery](../../sql-manual/sql-statements/Operators/in) operation returns a bitmap column. #### 2.runtime_filter_mode Used to control the transmission range of Runtime Filter between instances. @@ -163,30 +167,15 @@ Waiting for Runtime Filter is time consuming. **Precautions for use** -After the Runtime Filter is turned on, the ScanNode in the table on the left will wait for a period of time for each Runtime Filter assigned to itself before scanning the data, that is, if the ScanNode is assigned 3 Runtime Filters, it will wait at most 3000ms. +After the Runtime Filter is turned on, the ScanNode of the left table will wait for a while for the Runtime Filter assigned to it before scanning the data. Because it takes time to build and merge the Runtime Filter, ScanNode will try to push down the Runtime Filter that arrives within the waiting time to the storage engine. If the waiting time is exceeded, ScanNode will directly start scanning data using the Runtime Filter that has arrived. -If the Runtime Filter arrives after ScanNode starts scanning, ScanNode will not push the Runtime Filter down to the storage engine. Instead, it will use expression filtering on ScanNode based on the Runtime Filter for the data that has been scanned from the storage engine. The scanned data will not apply the Runtime Filter, so the intermediate data size obtained will be larger than the optimal solution, but serious cracking can be avoided. +If the Runtime Filter arrives after the ScanNode starts scanning, the ScanNode will not push the Runtime Filter down to the storage engine. Instead, the ScanNode will use an expression to filter the data that has been scanned from the storage engine based on the Runtime Filter. The Runtime Filter will not be applied to the data that has been scanned before. The size of the intermediate data obtained in this way will be larger than the optimal solution, but serious degradation can be avoided. If the cluster is busy and there are many resource-intensive or long-time-consuming queries on the cluster, consider increasing the waiting time to avoid missing optimization opportunities for complex queries. If the cluster load is light, and there are many small queries on the cluster that only take a few seconds, you can consider reducing the waiting time to avoid an increase of 1s for each query. -#### 4.runtime_filters_max_num -The upper limit of the number of Bloom Filters in the Runtime Filter generated by each query. - -**Type**: integer, default 10 - -**Precautions for use** -Currently, only the number of Bloom Filters is limited, because the construction and application of Bloom Filters are more expensive than MinMax Filter and IN predicate. - -If the number of Bloom Filters generated exceeds the maximum allowable number, then the Bloom Filter with a large selectivity is retained. A large selectivity means that more rows are expected to be filtered. This setting can prevent Bloom Filter from consuming too much memory overhead and causing potential problems. -``` -Selectivity = (HashJoinNode Cardinality / HashJoinNode left child Cardinality) -- Because the cardinality of FE is currently inaccurate, the selectivity of Bloom Filter calculation here is inaccurate, so in the end, it may only randomly reserve part of Bloom Filter. -``` -This query option needs to be adjusted only when tuning some long-consuming queries involving joins between large tables. - -#### 5. Bloom Filter length related parameters +#### 4. Bloom Filter length related parameters Including `runtime_bloom_filter_min_size`, `runtime_bloom_filter_max_size`, `runtime_bloom_filter_size`, used to determine the size (in bytes) of the Bloom Filter data structure used by the Runtime Filter. **Type**: Integer @@ -194,7 +183,8 @@ Including `runtime_bloom_filter_min_size`, `runtime_bloom_filter_max_size`, `run **Precautions for use** Because it is necessary to ensure that the length of the Bloom Filter constructed by each HashJoinNode is the same to be merged, the length of the Bloom Filter is currently calculated in the FE query planning. -If you can get the number of data rows (Cardinality) in the statistical information of the join right table, it will try to estimate the optimal size of the Bloom Filter based on Cardinality, and round to the nearest power of 2 (log value with the base 2). If the Cardinality of the table on the right cannot be obtained, the default Bloom Filter length `runtime_bloom_filter_size` will be used. `runtime_bloom_filter_min_size` and `runtime_bloom_filter_max_size` are used to limit the minimu [...] +If the number of data rows (Cardinality) in the statistics of the right table of the join can be obtained, the optimal size of the Bloom Filter will be estimated based on the Cardinality and rounded to the nearest power of 2 (log value with base 2). If there is no accurate statistics, but enable_sync_runtime_filter_size is turned on, the optimal size of the Bloom Filter will be estimated based on the actual number of data rows at runtime, but there will be some performance overhead cause [...] +Finally, if the Cardinality of the right table is still not available, the default Bloom Filter length `runtime_bloom_filter_size` will be used. `runtime_bloom_filter_min_size` and `runtime_bloom_filter_max_size` are used to limit the minimum and maximum lengths of the Bloom Filter that are ultimately used. Larger Bloom Filters are more effective when processing high-cardinality input sets, but require more memory. If the query needs to filter high cardinality columns (for example, containing millions of different values), you can consider increasing the value of `runtime_bloom_filter_size` for some benchmark tests, which will help make the Bloom Filter filter more accurate, so as to obtain the expected Performance improvement. @@ -215,39 +205,64 @@ CREATE TABLE test2 (t2 INT) DISTRIBUTED BY HASH (t2) BUCKETS 2 PROPERTIES("repli INSERT INTO test2 VALUES (3), (4), (5); EXPLAIN SELECT t1 FROM test JOIN test2 where test.t1 = test2.t2; -+-------------------------------------------------------------------+ -| Explain String | -+-------------------------------------------------------------------+ -| PLAN FRAGMENT 0 | -| OUTPUT EXPRS:`t1` | -| | -| 4:EXCHANGE | -| | -| PLAN FRAGMENT 1 | -| OUTPUT EXPRS: | -| PARTITION: HASH_PARTITIONED: `default_cluster:ssb`.`test`.`t1` | -| | -| 2:HASH JOIN | -| | join op: INNER JOIN (BUCKET_SHUFFLE) | -| | equal join conjunct: `test`.`t1` = `test2`.`t2` | -| | runtime filters: RF000[in] <- `test2`.`t2` | -| | | -| |----3:EXCHANGE | -| | | -| 0:OlapScanNode | -| TABLE: test | -| runtime filters: RF000[in] -> `test`.`t1` | -| | -| PLAN FRAGMENT 2 | -| OUTPUT EXPRS: | -| PARTITION: HASH_PARTITIONED: `default_cluster:ssb`.`test2`.`t2` | -| | -| 1:OlapScanNode | -| TABLE: test2 | -+-------------------------------------------------------------------+ --- The line of `runtime filters` above shows that `2:HASH JOIN` of `PLAN FRAGMENT 1` generates IN predicate with ID RF000, --- Among them, the key values of `test2`.`t2` are only known at runtime, --- This IN predicate is used in `0:OlapScanNode` to filter unnecessary data when reading `test`.`t1`. ++--------------------------------------------------------------------------------------------------+ +| Explain String(Nereids Planner) | ++--------------------------------------------------------------------------------------------------+ +| PLAN FRAGMENT 0 | +| OUTPUT EXPRS: | +| t1[#4] | +| PARTITION: HASH_PARTITIONED: t1[#1] | +| | +| HAS_COLO_PLAN_NODE: false | +| | +| VRESULT SINK | +| MYSQL_PROTOCAL | +| | +| 3:VHASH JOIN(157) | +| | join op: INNER JOIN(BUCKET_SHUFFLE)[] | +| | equal join conjunct: (t1[#1] = t2[#0]) | +| | runtime filters: RF000[min_max] <- t2[#0](3/4/2048), RF001[in_or_bloom] <- t2[#0](3/4/2048) | +| | cardinality=3 | +| | vec output tuple id: 3 | +| | output tuple id: 3 | +| | vIntermediate tuple ids: 2 | +| | hash output slot ids: 1 | +| | final projections: t1[#2] | +| | final project output tuple id: 3 | +| | distribute expr lists: t1[#1] | +| | distribute expr lists: t2[#0] | +| | | +| |----1:VEXCHANGE | +| | offset: 0 | +| | distribute expr lists: t2[#0] | +| | | +| 2:VOlapScanNode(150) | +| TABLE: test.test(test), PREAGGREGATION: ON | +| runtime filters: RF000[min_max] -> t1[#1], RF001[in_or_bloom] -> t1[#1] | +| partitions=1/1 (test) | +| tablets=2/2, tabletList=61032,61034 | +| cardinality=4, avgRowSize=0.0, numNodes=1 | +| pushAggOp=NONE | +| | +| PLAN FRAGMENT 1 | +| | +| PARTITION: HASH_PARTITIONED: t2[#0] | +| | +| HAS_COLO_PLAN_NODE: false | +| | +| STREAM DATA SINK | +| EXCHANGE ID: 01 | +| BUCKET_SHFFULE_HASH_PARTITIONED: t2[#0] | +| | +| 0:VOlapScanNode(151) | +| TABLE: test.test2(test2), PREAGGREGATION: ON | +| partitions=1/1 (test2) | +| tablets=2/2, tabletList=61041,61043 | +| cardinality=3, avgRowSize=0.0, numNodes=1 | +| pushAggOp=NONE | ++--------------------------------------------------------------------------------------------------+ +-- The line of `runtime filters` above shows that `2:HASH JOIN` of `PLAN FRAGMENT 1` generates min_max with ID RF000 and in_or_bloom with ID RF001, +-- RF000/RF001 are used in `2:VOlapScanNode(150)` to filter unnecessary data when reading `test`.`t1`. SELECT t1 FROM test JOIN test2 where test.t1 = test2.t2; -- Return 2 rows of results [3, 4]; @@ -255,28 +270,17 @@ SELECT t1 FROM test JOIN test2 where test.t1 = test2.t2; -- Through the query profile (set enable_profile=true;) you can view the detailed information of the internal work of the query, -- Including whether each Runtime Filter is pushed down, waiting time, -- and the total time from prepare to receiving Runtime Filter for OLAP_SCAN_NODE. -RuntimeFilter:in: - - HasPushDownToEngine: true - - AWaitTimeCost: 0ns - - EffectTimeCost: 2.76ms +RuntimeFilter: (id = 1, type = in_or_bloomfilter): + - Info: [IsPushDown = true, RuntimeFilterState = READY, HasRemoteTarget = false, HasLocalTarget = true, Ignored = false] + - RealRuntimeFilterType: in + - InFilterSize: 3 + - always_true: 0 + - expr_filtered_rows: 0 + - expr_input_rows: 0 +-- expr_input_rows and expr_filtered_rows are both 0 because in filter directly filters the data in advance according to the key range without calculating it row by row. -- In addition, in the OLAP_SCAN_NODE of the profile, you can also view the filtering effect -- and time consumption after the Runtime Filter is pushed down. - RowsVectorPredFiltered: 9.320008M (9320008) - VectorPredEvalTime: 364.39ms ``` - -## Runtime Filter planning rules -1. Only support the generation of Runtime Filter for the equivalent conditions in the join on clause, excluding the Null-safe condition, because it may filter out the null value of the join left table. -2. Does not support pushing down Runtime Filter to the left table of left outer, full outer, and anti join; -3. Does not support src expr or target expr is constant; -4. The equality of src expr and target expr is not supported; -5. The type of src expr is not supported to be equal to `HLL` or `BITMAP`; -6. Currently only supports pushing down Runtime Filter to OlapScanNode; -7. Target expr does not support NULL-checking expressions, such as `COALESCE/IFNULL/CASE`, because when the join on clause of other joins at the upper level of the outer join contains NULL-checking expressions and a Runtime Filter is generated, this Runtime Filter is downloaded Pushing to the left table of outer join may cause incorrect results; -8. The column (slot) in target expr is not supported, and an equivalent column cannot be found in the original table; -9. Column conduction is not supported. This includes two cases: - - First, when the join on clause contains A.k = B.k and B.k = C.k, currently C.k can only be pushed down to B.k, but not to A.k; - - Second, for example, the join on clause contains Aa + Bb = Cc. If Aa can be transmitted to Ba, that is, Aa and Ba are equivalent columns, then you can replace Aa with Ba, and then you can try to push the Runtime Filter down to B ( If Aa and Ba are not equivalent columns, they cannot be pushed down to B, because target expr must be bound to the only join left table); -10. The types of Target expr and src expr must be equal, because Bloom Filter is based on hash, if the types are not equal, it will try to convert the type of target expr to the type of src expr; -11. The Runtime Filter generated by `PlanNode.Conjuncts` is not supported. Unlike HashJoinNode's `eqJoinConjuncts` and `otherJoinConjuncts`, the Runtime Filter generated by `PlanNode.Conjuncts` found in the test that it may cause incorrect results, such as ` When an IN` subquery is converted to a join, the automatically generated join on clause will be stored in `PlanNode.Conjuncts`. At this time, applying Runtime Filter may result in missing some rows in the result. --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@doris.apache.org For additional commands, e-mail: commits-h...@doris.apache.org