Hi,
Elisa and I were looking at the production-pilot logs last night and
noticed the following:
Mar 10 04:41:45 radix-new bgpd[25100]: neighbor 2001:7f8:1::a501:6265:2
(LEASEWEB-v6-02) AS16265: withdraw 2001:1af8::/32
Mar 10 04:41:45 radix-new bgpd[12120]: neighbor 2001:7f8:1::a504:8345:1
(XSNEWS-v6-01): received notification: error in UPDATE message,
attribute list error
Mar 10 04:41:45 radix-new bgpd[12120]: neighbor 2001:7f8:1::a504:8345:1
(XSNEWS-v6-01): state change Established -> Idle, reason: NOTIFICATION
received
Mar 10 04:41:45 radix-new bgpd[12120]: neighbor 2001:7f8:1::a500:1200:2
(AS1200-v6-02): received notification: error in UPDATE message,
attribute list error
Mar 10 04:41:45 radix-new bgpd[12120]: neighbor 2001:7f8:1::a500:1200:2
(AS1200-v6-02): state change Established -> Idle, reason: NOTIFICATION
received
Mar 10 04:41:45 radix-new bgpd[12120]: neighbor 2001:7f8:1::a500:1200:1
(AS1200-v6-01): received notification: error in UPDATE message,
attribute list error
Mar 10 04:41:45 radix-new bgpd[12120]: neighbor 2001:7f8:1::a500:1200:1
(AS1200-v6-01): state change Established -> Idle, reason: NOTIFICATION
received
Mar 10 04:41:45 radix-new bgpd[12120]: neighbor 2001:7f8:1::a504:8345:2
(XSNEWS-v6-02): received notification: error in UPDATE message,
attribute list error
Mar 10 04:41:45 radix-new bgpd[12120]: neighbor 2001:7f8:1::a504:8345:2
(XSNEWS-v6-02): state change Established -> Idle, reason: NOTIFICATION
received
So this happened at at time that nobody was working on the route server.
As you can see, LEASEWEB-v6-02 withdraws a prefix, which crashes the
'foundry based routers'-sessions (both XSNEWS and AS1200). This lead us
to believe that the bug was somewhere in the withdraw-code. So we digged
through the code and found the following function: up_generate_updates
in rde_update.c
I commented the following lines to disable the advertisement of withdraw's:
/* withdraw prefix */
up_generate(peer, NULL, &addr, old->prefix->prefixlen);
After this, I was unable to initiate the bug. So I dugg deeper. I
re-enabled the withdraw (undone above), and commented out the following
code:
switch (up_test_update(peer, new)) {
case 1:
break;
case 0:
/* up_generate_updates(rules, peer, NULL, old); */
return;
case -1:
return;
}
This also fixed the problem, so I dugg deeper into up_test_update, first
undoing the above.
I commented out the following line:
if (p == NULL)
/* no prefix available */
/* return (0); */
This also fixed the problem. So now I can't dig any deeper. I'm just
wondering why a update with an empty prefix would be generated? So for
now this is a quick and dirty fix for the problem. Once Claudio has some
more time to digg into this, I hope there will be a real fix? I would
look for the problem in rde_generate_updates since it is the only place
besides the startup that calls up_generate_updates.
Kind regards,
Arnoud
On 3/9/09 8:18 PM, Elisa Jasinska wrote:
> Hi Henning and Claudio,
>
> Claudio Jeker wrote:
>
>> Btw. does this only happen with full IPv6 feeds or are a few
>> announcements already enough?
>>
>
> We have two test setups. One actually includes real peers, none sending
> a full table though. The other one is a setup in our lab, with various
> routers we could find, which only send a couple of routes to each other.
>
> We have seen this happening if the peer we 'clear' announces at least
> one prefix to the route server, so there is actually something to update.
>
> The behavior is different in the two setups though.
>
> With the real peers: multiple sessions go Idle upon 'clearing' one
> session and the broken UPDATE that gets send out with that, but they all
> come up again after a while.
>
> In the lab: the Idle sessions never come up completely, because the
> broken UPDATE seems to be send out repeatedly, causing the peer to go
> back to Idle immediately every time we reach an Established state.
>
> Henning Brauer wrote:
>
>> wait. removing tcpmd5 fixes the problem? you gotta be kidding?
>> this is on OpenBSD right?
>>
>>
>
> Sorry, this was a wrong assumption we made based on your previous post
> that there might be something wrong with it (and too many changes in our
> config at the same time ;)
>
> We are still busy with doing one change at a time now and trying to
> figure out what in the config actually causes this to happen. Once we
> get any conclusive results from this we will get back to you.
>
> Thanks a lot for your help!
>
> Regards
> Elisa