I deployed bgpd on one of more core routers and triggered the fatal "bad dmetric in decision process" from time to time.
I realized after a longer debugging session that one reason this happens is when nexthops become valid. The state change affects all prefixes at once but then they are reevaluated one by one (see prefix_evaluate_all() which is called by nexthop_runner()). I currently have no good solution for this issue. I think the problem is that invalid prefixes are not sorted when added. There may be a similar issue when flipping a rib from no-evaluate to evaluate in the reload code. For now neuter the fatalx and convert it to a log_debug() until I figured out a proper fix. -- :wq Claudio Index: rde_decide.c =================================================================== RCS file: /cvs/src/usr.sbin/bgpd/rde_decide.c,v retrieving revision 1.95 diff -u -p -r1.95 rde_decide.c --- rde_decide.c 11 Jul 2022 16:46:41 -0000 1.95 +++ rde_decide.c 16 Jul 2022 10:28:19 -0000 @@ -331,8 +331,12 @@ prefix_set_dmetric(struct prefix *pp, st PREFIX_DMETRIC_BEST : PREFIX_DMETRIC_INVALID; else np->dmetric = prefix_cmp(pp, np, &testall); - if (np->dmetric < 0) - fatalx("bad dmetric in decision process"); + if (np->dmetric < 0) { + struct bgpd_addr addr; + pt_getaddr(np->pt, &addr); + log_debug("bad dmetric in decision process: %s/%u", + log_addr(&addr), np->pt->prefixlen); + } } }