Hello,

I met an assertion failure, and identified the root of the problem, but no
idea how to fix it.

The location of the problematic Assert() is at cost_memoize_rescan() to
check 'hit_ratio' is between 0.0 and 1.0.
The 'calls' is provided by the caller, and 'ndistinct' is the result
of estimate_num_groups().

#4  0x000000000084d583 in cost_memoize_rescan (root=0x2e95748,
mpath=0x30aece8, rescan_startup_cost=0x7ffd72141260,
rescan_total_cost=0x7ffd72141258) at costsize.c:2564
/home/kaigai/source/pgsql-16/src/backend/optimizer/path/costsize.c:2564:83932:beg:0x84d583
(gdb) l
2559             * how many of those scans we expect to get a cache hit.
2560             */
2561            hit_ratio = ((calls - ndistinct) / calls) *
2562                    (est_cache_entries / Max(ndistinct,
est_cache_entries));
2563
2564            Assert(hit_ratio >= 0 && hit_ratio <= 1.0);
2565
2566            /*
2567             * Set the total_cost accounting for the expected cache hit
ratio.  We
2568             * also add on a cpu_operator_cost to account for a cache
lookup. This

(gdb) bt
#0  0x00007f3a39aa154c in __pthread_kill_implementation () from
/lib64/libc.so.6
#1  0x00007f3a39a54d06 in raise () from /lib64/libc.so.6
#2  0x00007f3a39a287f3 in abort () from /lib64/libc.so.6
#3  0x0000000000b6ff2c in ExceptionalCondition (conditionName=0xd28c28
"hit_ratio >= 0 && hit_ratio <= 1.0", fileName=0xd289a4 "costsize.c",
lineNumber=2564) at assert.c:66
#4  0x000000000084d583 in cost_memoize_rescan (root=0x2e95748,
mpath=0x30aece8, rescan_startup_cost=0x7ffd72141260,
rescan_total_cost=0x7ffd72141258) at costsize.c:2564
#5  0x0000000000850831 in cost_rescan (root=0x2e95748, path=0x30aece8,
rescan_startup_cost=0x7ffd72141260, rescan_total_cost=0x7ffd72141258) at
costsize.c:4350
#6  0x000000000084e333 in initial_cost_nestloop (root=0x2e95748,
workspace=0x7ffd721412d0, jointype=JOIN_INNER, outer_path=0x3090058,
inner_path=0x30aece8, extra=0x7ffd72141500) at costsize.c:2978
#7  0x0000000000860f58 in try_partial_nestloop_path (root=0x2e95748,
joinrel=0x30ae158, outer_path=0x3090058, inner_path=0x30aece8,
pathkeys=0x0, jointype=JOIN_INNER, extra=0x7ffd72141500) at joinpath.c:887
#8  0x0000000000862a64 in consider_parallel_nestloop (root=0x2e95748,
joinrel=0x30ae158, outerrel=0x308f428, innerrel=0x2eac390,
jointype=JOIN_INNER, extra=0x7ffd72141500) at joinpath.c:2083
#9  0x000000000086273d in match_unsorted_outer (root=0x2e95748,
joinrel=0x30ae158, outerrel=0x308f428, innerrel=0x2eac390,
jointype=JOIN_INNER, extra=0x7ffd72141500) at joinpath.c:1940
#10 0x00000000008600f0 in add_paths_to_joinrel (root=0x2e95748,
joinrel=0x30ae158, outerrel=0x308f428, innerrel=0x2eac390,
jointype=JOIN_INNER, sjinfo=0x7ffd721415f0, restrictlist=0x30ae5a8) at
joinpath.c:296
#11 0x0000000000864d10 in populate_joinrel_with_paths (root=0x2e95748,
rel1=0x308f428, rel2=0x2eac390, joinrel=0x30ae158, sjinfo=0x7ffd721415f0,
restrictlist=0x30ae5a8) at joinrels.c:925
#12 0x00000000008649e1 in make_join_rel (root=0x2e95748, rel1=0x308f428,
rel2=0x2eac390) at joinrels.c:776
#13 0x0000000000863ec1 in make_rels_by_clause_joins (root=0x2e95748,
old_rel=0x308f428, other_rels_list=0x3088ed0, other_rels=0x3088ee8) at
joinrels.c:312
#14 0x000000000086399a in join_search_one_level (root=0x2e95748, level=3)
at joinrels.c:123
#15 0x00000000008463f8 in standard_join_search (root=0x2e95748,
levels_needed=4, initial_rels=0x3088ed0) at allpaths.c:3454
#16 0x000000000084636d in make_rel_from_joinlist (root=0x2e95748,
joinlist=0x306b4f8) at allpaths.c:3385
#17 0x0000000000841548 in make_one_rel (root=0x2e95748, joinlist=0x306b4f8)
at allpaths.c:229
#18 0x00000000008806a9 in query_planner (root=0x2e95748,
qp_callback=0x886bcb <standard_qp_callback>, qp_extra=0x7ffd72141960) at
planmain.c:278
#19 0x0000000000882f5f in grouping_planner (root=0x2e95748,
tuple_fraction=0) at planner.c:1495
#20 0x000000000088268c in subquery_planner (glob=0x2e95348,
parse=0x2e90e98, parent_root=0x0, hasRecursion=false, tuple_fraction=0) at
planner.c:1064
#21 0x0000000000880cdb in standard_planner (parse=0x2e90e98,
    query_string=0x2e3a0e8 "explain\nselect sum(lo_revenue), d_year,
p_brand1\n  from lineorder, date1, part, supplier\n  where lo_orderdate =
d_datekey\n    and lo_partkey = p_partkey\n    and lo_suppkey = s_suppkey\n
   and p_brand1"..., cursorOptions=2048,
    boundParams=0x0) at planner.c:413

I tracked the behavior of estimate_num_groups() using gdb line-by-line to
observe how 'input_rows' is changed
and how it affects the result value.
According to the call trace, the problematic estimate_num_groups()
invocation is called with "input_rows=3251872.916666667",
then it was rounded up to 3251873 by the clamp_row_est(). Eventually, its
result value was calculated larger than the upper
limit, so the return value was suppressed by 3251873, but it is a tiny bit
larger than the input value!

Back to the cost_memoize_rescan().
The hit_ratio is calculated as follows:

    hit_ratio = ((calls - ndistinct) / calls) *
        (est_cache_entries / Max(ndistinct, est_cache_entries));

The "calls" is the "input_rows" above, and "ndistinct"  is the return value
of the estimate_num_groups().
What happen if "ndistinct" is a tiny bit larger than "calls"?
In the results, the "hit_ratio" is calculated as a very small negative
value, then it was terminated by Assert().

How do we fix the logic? Please some ideas.

Best regards,
-- 
HeteroDB, Inc / The PG-Strom Project
KaiGai Kohei <kai...@heterodb.com>

Reply via email to