On Fri, Sep 30, 2022 at 9:20 PM Zhihong Yu <z...@yugabyte.com> wrote:
> > > On Fri, Sep 30, 2022 at 8:40 PM Zhihong Yu <z...@yugabyte.com> wrote: > >> >> >> On Fri, Sep 30, 2022 at 3:44 PM Zheng Li <zhengl...@gmail.com> wrote: >> >>> Hello, >>> >>> A bloom filter provides early filtering of rows that cannot be joined >>> before they would reach the join operator, the optimization is also >>> called a semi join filter (SJF) pushdown. Such a filter can be created >>> when one child of the join operator must materialize its derived table >>> before the other child is evaluated. >>> >>> For example, a bloom filter can be created using the the join keys for >>> the build side/inner side of a hash join or the outer side of a merge >>> join, the bloom filter can then be used to pre-filter rows on the >>> other side of the join operator during the scan of the base relation. >>> The thread about “Hash Joins vs. Bloom Filters / take 2” [1] is good >>> discussion on using such optimization for hash join without going into >>> the pushdown of the filter where its performance gain could be further >>> increased. >>> >>> We worked on prototyping bloom filter pushdown for both hash join and >>> merge join. Attached is a patch set for bloom filter pushdown for >>> merge join. We also plan to send the patch for hash join once we have >>> it rebased. >>> >>> Here is a summary of the patch set: >>> 1. Bloom Filter Pushdown optimizes Merge Join by filtering rows early >>> during the table scan instead of later on. >>> -The bloom filter is pushed down along the execution tree to >>> the target SeqScan nodes. >>> -Experiments show that this optimization can speed up Merge >>> Join by up to 36%. >>> >>> 2. The planner makes the decision to use the bloom filter based on the >>> estimated filtering rate and the expected performance gain. >>> -The planner accomplishes this by estimating four numbers per >>> variable - the total number of rows of the relation, the number of >>> distinct values for a given variable, and the minimum and maximum >>> value of the variable (when applicable). Using these numbers, the >>> planner estimates a filtering rate of a potential filter. >>> -Because actually creating and implementing the filter adds >>> more operations, there is a minimum threshold of filtering where the >>> filter would actually be useful. Based on testing, we query to see if >>> the estimated filtering rate is higher than 35%, and that informs our >>> decision to use a filter or not. >>> >>> 3. If using a bloom filter, the planner also adjusts the expected cost >>> of Merge Join based on expected performance gain. >>> >>> 4. Capability to build the bloom filter in parallel in case of >>> parallel SeqScan. This is done efficiently by populating a local bloom >>> filter for each parallel worker and then taking a bitwise OR over all >>> the local bloom filters to form a shared bloom filter at the end of >>> the parallel SeqScan. >>> >>> 5. The optimization is GUC controlled, with settings of >>> enable_mergejoin_semijoin_filter and force_mergejoin_semijoin_filter. >>> >>> We found in experiments that there is a significant improvement >>> when using the bloom filter during Merge Join. One experiment involved >>> joining two large tables while varying the theoretical filtering rate >>> (TFR) between the two tables, the TFR is defined as the percentage >>> that the two datasets are disjoint. Both tables in the merge join were >>> the same size. We tested changing the TFR to see the change in >>> filtering optimization. >>> >>> For example, let’s imagine t0 has 10 million rows, which contain the >>> numbers 1 through 10 million randomly shuffled. Also, t1 has the >>> numbers 4 million through 14 million randomly shuffled. Then the TFR >>> for a join of these two tables is 40%, since 40% of the tables are >>> disjoint from the other table (1 through 4 million for t0, 10 million >>> through 14 million for t4). >>> >>> Here is the performance test result joining two tables: >>> TFR: theoretical filtering rate >>> EFR: estimated filtering rate >>> AFR: actual filtering rate >>> HJ: hash join >>> MJ Default: default merge join >>> MJ Filter: merge join with bloom filter optimization enabled >>> MJ Filter Forced: merge join with bloom filter optimization forced >>> >>> TFR EFR AFR HJ MJ Default MJ Filter MJ Filter Forced >>> >>> ------------------------------------------------------------------------------------- >>> 10 33.46 7.41 6529 22638 21949 23160 >>> 20 37.27 14.85 6483 22290 21928 21930 >>> 30 41.32 22.25 6395 22374 20718 20794 >>> 40 45.67 29.7 6272 21969 19449 19410 >>> 50 50.41 37.1 6210 21412 18222 18224 >>> 60 55.64 44.51 6052 21108 17060 17018 >>> 70 61.59 51.98 5947 21020 15682 15737 >>> 80 68.64 59.36 5761 20812 14411 14437 >>> 90 77.83 66.86 5701 20585 13171 13200 >>> Table. Execution Time (ms) vs Filtering Rate (%) for Joining Two >>> Tables of 10M Rows. >>> >>> Attached you can find figures of the same performance test and a SQL >>> script >>> to reproduce the performance test. >>> >>> The first thing to notice is that Hash Join generally is the most >>> efficient join strategy. This is because Hash Join is better at >>> dealing with small tables, and our size of 10 million is still small >>> enough where Hash Join outperforms the other join strategies. Future >>> experiments can investigate using much larger tables. >>> >>> However, comparing just within the different Merge Join variants, we >>> see that using the bloom filter greatly improves performance. >>> Intuitively, all of these execution times follow linear paths. >>> Comparing forced filtering versus default, we can see that the default >>> Merge Join outperforms Merge Join with filtering at low filter rates, >>> but after about 20% TFR, the Merge Join with filtering outperforms >>> default Merge Join. This makes intuitive sense, as there are some >>> fixed costs associated with building and checking with the bloom >>> filter. In the worst case, at only 10% TFR, the bloom filter makes >>> Merge Join less than 5% slower. However, in the best case, at 90% TFR, >>> the bloom filter improves Merge Join by 36%. >>> >>> Based on the results of the above experiments, we came up with a >>> linear equation for the performance ratio for using the filter >>> pushdown from the actual filtering rate. Based on the numbers >>> presented in the figure, this is the equation: >>> >>> T_filter / T_no_filter = 1 / (0.83 * estimated filtering rate + 0.863) >>> >>> For example, this means that with an estimated filtering rate of 0.4, >>> the execution time of merge join is estimated to be improved by 16.3%. >>> Note that the estimated filtering rate is used in the equation, not >>> the theoretical filtering rate or the actual filtering rate because it >>> is what we have during planning. In practice the estimated filtering >>> rate isn’t usually accurate. In fact, the estimated filtering rate can >>> differ from the theoretical filtering rate by as much as 17% in our >>> experiments. One way to mitigate the power loss of bloom filter caused >>> by inaccurate estimated filtering rate is to adaptively turn it off at >>> execution time, this is yet to be implemented. >>> >>> Here is a list of tasks we plan to work on in order to improve this >>> patch: >>> 1. More regression testing to guarantee correctness. >>> 2. More performance testing involving larger tables and complicated >>> query plans. >>> 3. Improve the cost model. >>> 4. Explore runtime tuning such as making the bloom filter checking >>> adaptive. >>> 5. Currently, only the best single join key is used for building the >>> Bloom filter. However, if there are several keys and we know that >>> their distributions are somewhat disjoint, we could leverage this fact >>> and use multiple keys for the bloom filter. >>> 6. Currently, Bloom filter pushdown is only implemented for SeqScan >>> nodes. However, it would be possible to allow push down to other types >>> of scan nodes. >>> 7. Explore if the Bloom filter could be pushed down through a foreign >>> scan when the foreign server is capable of handling it – which could >>> be made true for postgres_fdw. >>> 8. Better explain command on the usage of bloom filters. >>> >>> This patch set is prepared by Marcus Ma, Lyu Pan and myself. Feedback >>> is appreciated. >>> >>> With Regards, >>> Zheng Li >>> Amazon RDS/Aurora for PostgreSQL >>> >>> [1] >>> https://www.postgresql.org/message-id/flat/c902844d-837f-5f63-ced3-9f7fd222f175%402ndquadrant.com >> >> >> Hi, >> In the header of patch 1: >> >> In this prototype, the cost model is based on an assumption that there is >> a linear relationship between the performance gain from using a semijoin >> filter and the estimated filtering rate: >> % improvement to Merge Join cost = 0.83 * estimated filtering rate - >> 0.137. >> >> How were the coefficients (0.83 and 0.137) determined ? >> I guess they were based on the results of running certain workload. >> >> Cheers >> > Hi, > For patch 1: > > +bool enable_mergejoin_semijoin_filter; > +bool force_mergejoin_semijoin_filter; > > How would (enable_mergejoin_semijoin_filter = off, > force_mergejoin_semijoin_filter = on) be interpreted ? > Have you considered using one GUC which has three values: off, enabled, > forced ? > > + mergeclauses_for_sjf = get_actual_clauses(path->path_mergeclauses); > + mergeclauses_for_sjf = > get_switched_clauses(path->path_mergeclauses, > + > path->jpath.outerjoinpath->parent->relids); > > mergeclauses_for_sjf is assigned twice and I don't > see mergeclauses_for_sjf being reference in the call > to get_switched_clauses(). > Is this intentional ? > > + /* want at least 1000 rows_filtered to avoid any nasty edge > cases */ > + if (force_mergejoin_semijoin_filter || (filteringRate >= 0.35 > && rows_filtered > 1000)) > > The above condition is narrower compared to the enclosing condition. > Since there is no else block for the second if block, please merge the two > if statements. > > + int best_filter_clause; > > Normally I would think `clause` is represented by List*. But > best_filter_clause is an int. Please use another variable name so that > there is less chance of confusion. > > For evaluate_semijoin_filtering_rate(): > > + double best_sj_selectivity = 1.01; > > How was 1.01 determined ? > > + debug_sj1("SJPD: start evaluate_semijoin_filtering_rate"); > > There are debug statements in the methods. > It would be better to remove them in the next patch set. > > Cheers > Hi, Still patch 1. + if (!outer_arg_md->is_or_maps_to_base_column + && !inner_arg_md->is_or_maps_to_constant) + { + debug_sj2("SJPD: outer equijoin arg does not map %s", + "to a base column nor a constant; semijoin is not valid"); Looks like there is a typo: inner_arg_md->is_or_maps_to_constant should be outer_arg_md->is_or_maps_to_constant + if (outer_arg_md->est_col_width > MAX_SEMIJOIN_SINGLE_KEY_WIDTH) + { + debug_sj2("SJPD: outer equijoin column's width %s", + "was excessive; condition rejected"); How is the value of MAX_SEMIJOIN_SINGLE_KEY_WIDTH determined ? For verify_valid_pushdown(): + Assert(path); + Assert(target_var_no > 0); + + if (path == NULL) + { + return false; I don't understand the first assertion. Does it mean path would always be non-NULL ? Then the if statement should be dropped. + if (path->parent->relid == target_var_no) + { + /* + * Found source of target var! We know that the pushdown + * is valid now. + */ + return true; + } + return false; The above can be simplified as: return path->parent->relid == target_var_no; + * True if the given con_exprs, ref_exprs and operators will exactlty Typo: exactlty -> exactly + if (!bms_equal(all_vars, matched_vars)) + return false; + return true; The above can be simplified as: return bms_equal(all_vars, matched_vars); Cheers