Hello, I met an assertion failure, and identified the root of the problem, but no idea how to fix it.
The location of the problematic Assert() is at cost_memoize_rescan() to check 'hit_ratio' is between 0.0 and 1.0. The 'calls' is provided by the caller, and 'ndistinct' is the result of estimate_num_groups(). #4 0x000000000084d583 in cost_memoize_rescan (root=0x2e95748, mpath=0x30aece8, rescan_startup_cost=0x7ffd72141260, rescan_total_cost=0x7ffd72141258) at costsize.c:2564 /home/kaigai/source/pgsql-16/src/backend/optimizer/path/costsize.c:2564:83932:beg:0x84d583 (gdb) l 2559 * how many of those scans we expect to get a cache hit. 2560 */ 2561 hit_ratio = ((calls - ndistinct) / calls) * 2562 (est_cache_entries / Max(ndistinct, est_cache_entries)); 2563 2564 Assert(hit_ratio >= 0 && hit_ratio <= 1.0); 2565 2566 /* 2567 * Set the total_cost accounting for the expected cache hit ratio. We 2568 * also add on a cpu_operator_cost to account for a cache lookup. This (gdb) bt #0 0x00007f3a39aa154c in __pthread_kill_implementation () from /lib64/libc.so.6 #1 0x00007f3a39a54d06 in raise () from /lib64/libc.so.6 #2 0x00007f3a39a287f3 in abort () from /lib64/libc.so.6 #3 0x0000000000b6ff2c in ExceptionalCondition (conditionName=0xd28c28 "hit_ratio >= 0 && hit_ratio <= 1.0", fileName=0xd289a4 "costsize.c", lineNumber=2564) at assert.c:66 #4 0x000000000084d583 in cost_memoize_rescan (root=0x2e95748, mpath=0x30aece8, rescan_startup_cost=0x7ffd72141260, rescan_total_cost=0x7ffd72141258) at costsize.c:2564 #5 0x0000000000850831 in cost_rescan (root=0x2e95748, path=0x30aece8, rescan_startup_cost=0x7ffd72141260, rescan_total_cost=0x7ffd72141258) at costsize.c:4350 #6 0x000000000084e333 in initial_cost_nestloop (root=0x2e95748, workspace=0x7ffd721412d0, jointype=JOIN_INNER, outer_path=0x3090058, inner_path=0x30aece8, extra=0x7ffd72141500) at costsize.c:2978 #7 0x0000000000860f58 in try_partial_nestloop_path (root=0x2e95748, joinrel=0x30ae158, outer_path=0x3090058, inner_path=0x30aece8, pathkeys=0x0, jointype=JOIN_INNER, extra=0x7ffd72141500) at joinpath.c:887 #8 0x0000000000862a64 in consider_parallel_nestloop (root=0x2e95748, joinrel=0x30ae158, outerrel=0x308f428, innerrel=0x2eac390, jointype=JOIN_INNER, extra=0x7ffd72141500) at joinpath.c:2083 #9 0x000000000086273d in match_unsorted_outer (root=0x2e95748, joinrel=0x30ae158, outerrel=0x308f428, innerrel=0x2eac390, jointype=JOIN_INNER, extra=0x7ffd72141500) at joinpath.c:1940 #10 0x00000000008600f0 in add_paths_to_joinrel (root=0x2e95748, joinrel=0x30ae158, outerrel=0x308f428, innerrel=0x2eac390, jointype=JOIN_INNER, sjinfo=0x7ffd721415f0, restrictlist=0x30ae5a8) at joinpath.c:296 #11 0x0000000000864d10 in populate_joinrel_with_paths (root=0x2e95748, rel1=0x308f428, rel2=0x2eac390, joinrel=0x30ae158, sjinfo=0x7ffd721415f0, restrictlist=0x30ae5a8) at joinrels.c:925 #12 0x00000000008649e1 in make_join_rel (root=0x2e95748, rel1=0x308f428, rel2=0x2eac390) at joinrels.c:776 #13 0x0000000000863ec1 in make_rels_by_clause_joins (root=0x2e95748, old_rel=0x308f428, other_rels_list=0x3088ed0, other_rels=0x3088ee8) at joinrels.c:312 #14 0x000000000086399a in join_search_one_level (root=0x2e95748, level=3) at joinrels.c:123 #15 0x00000000008463f8 in standard_join_search (root=0x2e95748, levels_needed=4, initial_rels=0x3088ed0) at allpaths.c:3454 #16 0x000000000084636d in make_rel_from_joinlist (root=0x2e95748, joinlist=0x306b4f8) at allpaths.c:3385 #17 0x0000000000841548 in make_one_rel (root=0x2e95748, joinlist=0x306b4f8) at allpaths.c:229 #18 0x00000000008806a9 in query_planner (root=0x2e95748, qp_callback=0x886bcb <standard_qp_callback>, qp_extra=0x7ffd72141960) at planmain.c:278 #19 0x0000000000882f5f in grouping_planner (root=0x2e95748, tuple_fraction=0) at planner.c:1495 #20 0x000000000088268c in subquery_planner (glob=0x2e95348, parse=0x2e90e98, parent_root=0x0, hasRecursion=false, tuple_fraction=0) at planner.c:1064 #21 0x0000000000880cdb in standard_planner (parse=0x2e90e98, query_string=0x2e3a0e8 "explain\nselect sum(lo_revenue), d_year, p_brand1\n from lineorder, date1, part, supplier\n where lo_orderdate = d_datekey\n and lo_partkey = p_partkey\n and lo_suppkey = s_suppkey\n and p_brand1"..., cursorOptions=2048, boundParams=0x0) at planner.c:413 I tracked the behavior of estimate_num_groups() using gdb line-by-line to observe how 'input_rows' is changed and how it affects the result value. According to the call trace, the problematic estimate_num_groups() invocation is called with "input_rows=3251872.916666667", then it was rounded up to 3251873 by the clamp_row_est(). Eventually, its result value was calculated larger than the upper limit, so the return value was suppressed by 3251873, but it is a tiny bit larger than the input value! Back to the cost_memoize_rescan(). The hit_ratio is calculated as follows: hit_ratio = ((calls - ndistinct) / calls) * (est_cache_entries / Max(ndistinct, est_cache_entries)); The "calls" is the "input_rows" above, and "ndistinct" is the return value of the estimate_num_groups(). What happen if "ndistinct" is a tiny bit larger than "calls"? In the results, the "hit_ratio" is calculated as a very small negative value, then it was terminated by Assert(). How do we fix the logic? Please some ideas. Best regards, -- HeteroDB, Inc / The PG-Strom Project KaiGai Kohei <kai...@heterodb.com>