I deployed bgpd on one of more core routers and triggered the fatal
"bad dmetric in decision process" from time to time.

I realized after a longer debugging session that one reason this happens
is when nexthops become valid. The state change affects all prefixes at
once but then they are reevaluated one by one (see prefix_evaluate_all()
which is called by nexthop_runner()).

I currently have no good solution for this issue. I think the problem is
that invalid prefixes are not sorted when added. There may be a similar
issue when flipping a rib from no-evaluate to evaluate in the reload code.

For now neuter the fatalx and convert it to a log_debug() until I figured
out a proper fix.
-- 
:wq Claudio

Index: rde_decide.c
===================================================================
RCS file: /cvs/src/usr.sbin/bgpd/rde_decide.c,v
retrieving revision 1.95
diff -u -p -r1.95 rde_decide.c
--- rde_decide.c        11 Jul 2022 16:46:41 -0000      1.95
+++ rde_decide.c        16 Jul 2022 10:28:19 -0000
@@ -331,8 +331,12 @@ prefix_set_dmetric(struct prefix *pp, st
                            PREFIX_DMETRIC_BEST : PREFIX_DMETRIC_INVALID;
                else
                        np->dmetric = prefix_cmp(pp, np, &testall);
-               if (np->dmetric < 0)
-                       fatalx("bad dmetric in decision process");
+               if (np->dmetric < 0) {
+                       struct bgpd_addr addr;
+                       pt_getaddr(np->pt, &addr);
+                       log_debug("bad dmetric in decision process: %s/%u",
+                           log_addr(&addr), np->pt->prefixlen);
+               }
        }
 }
 

Reply via email to