Hi, Elisa and I were looking at the production-pilot logs last night and noticed the following:
Mar 10 04:41:45 radix-new bgpd[25100]: neighbor 2001:7f8:1::a501:6265:2 (LEASEWEB-v6-02) AS16265: withdraw 2001:1af8::/32 Mar 10 04:41:45 radix-new bgpd[12120]: neighbor 2001:7f8:1::a504:8345:1 (XSNEWS-v6-01): received notification: error in UPDATE message, attribute list error Mar 10 04:41:45 radix-new bgpd[12120]: neighbor 2001:7f8:1::a504:8345:1 (XSNEWS-v6-01): state change Established -> Idle, reason: NOTIFICATION received Mar 10 04:41:45 radix-new bgpd[12120]: neighbor 2001:7f8:1::a500:1200:2 (AS1200-v6-02): received notification: error in UPDATE message, attribute list error Mar 10 04:41:45 radix-new bgpd[12120]: neighbor 2001:7f8:1::a500:1200:2 (AS1200-v6-02): state change Established -> Idle, reason: NOTIFICATION received Mar 10 04:41:45 radix-new bgpd[12120]: neighbor 2001:7f8:1::a500:1200:1 (AS1200-v6-01): received notification: error in UPDATE message, attribute list error Mar 10 04:41:45 radix-new bgpd[12120]: neighbor 2001:7f8:1::a500:1200:1 (AS1200-v6-01): state change Established -> Idle, reason: NOTIFICATION received Mar 10 04:41:45 radix-new bgpd[12120]: neighbor 2001:7f8:1::a504:8345:2 (XSNEWS-v6-02): received notification: error in UPDATE message, attribute list error Mar 10 04:41:45 radix-new bgpd[12120]: neighbor 2001:7f8:1::a504:8345:2 (XSNEWS-v6-02): state change Established -> Idle, reason: NOTIFICATION received So this happened at at time that nobody was working on the route server. As you can see, LEASEWEB-v6-02 withdraws a prefix, which crashes the 'foundry based routers'-sessions (both XSNEWS and AS1200). This lead us to believe that the bug was somewhere in the withdraw-code. So we digged through the code and found the following function: up_generate_updates in rde_update.c I commented the following lines to disable the advertisement of withdraw's: /* withdraw prefix */ up_generate(peer, NULL, &addr, old->prefix->prefixlen); After this, I was unable to initiate the bug. So I dugg deeper. I re-enabled the withdraw (undone above), and commented out the following code: switch (up_test_update(peer, new)) { case 1: break; case 0: /* up_generate_updates(rules, peer, NULL, old); */ return; case -1: return; } This also fixed the problem, so I dugg deeper into up_test_update, first undoing the above. I commented out the following line: if (p == NULL) /* no prefix available */ /* return (0); */ This also fixed the problem. So now I can't dig any deeper. I'm just wondering why a update with an empty prefix would be generated? So for now this is a quick and dirty fix for the problem. Once Claudio has some more time to digg into this, I hope there will be a real fix? I would look for the problem in rde_generate_updates since it is the only place besides the startup that calls up_generate_updates. Kind regards, Arnoud On 3/9/09 8:18 PM, Elisa Jasinska wrote: > Hi Henning and Claudio, > > Claudio Jeker wrote: > >> Btw. does this only happen with full IPv6 feeds or are a few >> announcements already enough? >> > > We have two test setups. One actually includes real peers, none sending > a full table though. The other one is a setup in our lab, with various > routers we could find, which only send a couple of routes to each other. > > We have seen this happening if the peer we 'clear' announces at least > one prefix to the route server, so there is actually something to update. > > The behavior is different in the two setups though. > > With the real peers: multiple sessions go Idle upon 'clearing' one > session and the broken UPDATE that gets send out with that, but they all > come up again after a while. > > In the lab: the Idle sessions never come up completely, because the > broken UPDATE seems to be send out repeatedly, causing the peer to go > back to Idle immediately every time we reach an Established state. > > Henning Brauer wrote: > >> wait. removing tcpmd5 fixes the problem? you gotta be kidding? >> this is on OpenBSD right? >> >> > > Sorry, this was a wrong assumption we made based on your previous post > that there might be something wrong with it (and too many changes in our > config at the same time ;) > > We are still busy with doing one change at a time now and trying to > figure out what in the config actually causes this to happen. Once we > get any conclusive results from this we will get back to you. > > Thanks a lot for your help! > > Regards > Elisa