Hi, At our network we use bird in all our Linux routers so they talk OSPF. Recently we added a new router appliance(not Linux) to the backbone and most of the bird routers showed wrong next hops for some routes. At first we tought that the problem might be caused by a bug in the appliance making it send bad LSAs, but debugging the ospf packets and looking at other non-bird router appliances made us certain that the fault was in bird.
After looking through the code we found that the bug is in function 'ospf_ext_spf' in file 'proto/ospf/rt.c'. The variable 'nfa' of type 'struct orta' is not totally cleaned before each start of the 'WALK_SLIST' loop, and can have a wrong value of 'nfa.nhs_reuse = 1' instead of 'nfa.nhs_reuse = 0', which is what triggers the bug. The fix is to put the declaration of 'orta nfa = {};' inside the loop, so the variable is clean at the start of each loop. Alternatively at the start of each loop you could set 'nfa.nhs_reuse = 0'. The bug only occurs when the following options are in bird.conf: ecmp yes; merge external yes; At least from what I saw the bug affects versions '2.0', '1.6.2' and '1.5'. If some of the explanation is not clear please tell me and I'll try to write it better. I'll provide a more in-depth explanation now: the relevant code is: the code is 'tagged' with 'x->' which are the relevant points for trigerring the bug static void ospf_ext_spf(struct ospf_proto *p) { ort *nf1, *nf2; 0->orta nfa = {}; ip_addr rtid; OSPF_TRACE(D_EVENTS, "Starting routing table calculation for ext routes"); WALK_SLIST(en, p->lsal) { /*not showing LSA validation*/ rtid = ipa_from_rid(en->lsa.rt); 1-> nf1 = fib_find(&atmp->rtr, &rtid, MAX_PREFIX_LENGTH); /*not showing other validations*/ if (!rt.fbit) { nf2 = nf1; 2-> nfa.nhs = nf1->n.nhs; br_metric = nf1->n.metric1; } else { nf2 = ospf_fib_route(&p->rtf, rt.fwaddr, MAX_PREFIX_LENGTH); if (!nf2) { continue; } /*not showing some validations*/ nfa.nhs = nf2->n.nhs; br_metric = nf2->n.metric1; if (has_device_nexthops(nfa.nhs)) { nfa.nhs = fix_device_nexthops(p, nfa.nhs, rt.fwaddr); 3-> nfa.nhs_reuse = 1; } } /*not showing setting other attributes in 'nfa'*/ 4-> ri_install_ext(p, rt.ip, rt.pxlen, &nfa); } } static inline void ri_install_ext(struct ospf_proto *p, ip_addr prefix, int pxlen, const orta *new) { 5->ort *old = (ort *) fib_get(&p->rtf, &prefix, pxlen); int cmp = orta_compare_ext(p, new, &old->n); if (cmp > 0) { ort_replace(old, new); } else if (cmp == 0) { 6-> ort_merge_ext(p, old, new); } } static void ort_merge_ext(struct ospf_proto *p, ort *o, const orta *new) { orta *old = &o->n; if (old->nhs != new->nhs) { 7-> old->nhs = mpnh_merge(old->nhs, new->nhs, old->nhs_reuse, new- >nhs_reuse, p->ecmp, p->nhpool); old->nhs_reuse = 1; } /* not showing code that sets other attributes */ } struct mpnh * mpnh_merge(struct mpnh *x, struct mpnh *y, int rx, int ry, int max, linpool *lp) { struct mpnh *root = NULL; struct mpnh **n = &root; while ((x || y) && max--) { int cmp = mpnh_compare_node(x, y); if (cmp < 0) { *n = rx ? x : mpnh_copy_node(x, lp); x = x->next; } else if (cmp > 0) { 8-> *n = ry ? y : mpnh_copy_node(y, lp); y = y->next; } else { 9-> *n = rx ? x : (ry ? y : mpnh_copy_node(x, lp)); x = x->next; y = y->next; } n = &((*n)->next); } *n = NULL; return root; } conditions to trigger the bug: * while in the loop there must be a LSA that has a forwading address which routes satisfy the conditions to trigger 'has_device_nexthops' so 'nfa.nhs_reuse' is set to 1, this corresponds to '3->'. Afterwards the bug could be trigger in any LSA that has the following properties: * LSA forwarding address equal to 0.0.0.0; so we set 'nfa' to the route to the router that sent the LSA. This last route is the one might be corrupted. This corresponds to '1->' and '2->'; * the route passes the validations to be installed in the OSPF routing table. This corresponds to '4->' * the old route that we had for the destination on the LSA is equally good to the new route that we are installing so we decide to merge them. This corresponds to '5->' and '6->' * The routes aren't the same instance so we decide to merge them. '7->' * While merging the next hop list we must compare at least one next hop from the 'new' route higher than a next hop from the 'old' route so we enter '8->' and wrongly not copy the next hop because 'ry = 1' due to 'nfa.nhs_reuse = 1'. This might also happen in '9->' if the next hops compare equal, but I haven't checked.