Hi,

Elisa and I were looking at the production-pilot logs last night and 
noticed the following:

Mar 10 04:41:45 radix-new bgpd[25100]: neighbor 2001:7f8:1::a501:6265:2 
(LEASEWEB-v6-02) AS16265: withdraw 2001:1af8::/32
Mar 10 04:41:45 radix-new bgpd[12120]: neighbor 2001:7f8:1::a504:8345:1 
(XSNEWS-v6-01): received notification: error in UPDATE message, 
attribute list error
Mar 10 04:41:45 radix-new bgpd[12120]: neighbor 2001:7f8:1::a504:8345:1 
(XSNEWS-v6-01): state change Established -> Idle, reason: NOTIFICATION 
received
Mar 10 04:41:45 radix-new bgpd[12120]: neighbor 2001:7f8:1::a500:1200:2 
(AS1200-v6-02): received notification: error in UPDATE message, 
attribute list error
Mar 10 04:41:45 radix-new bgpd[12120]: neighbor 2001:7f8:1::a500:1200:2 
(AS1200-v6-02): state change Established -> Idle, reason: NOTIFICATION 
received
Mar 10 04:41:45 radix-new bgpd[12120]: neighbor 2001:7f8:1::a500:1200:1 
(AS1200-v6-01): received notification: error in UPDATE message, 
attribute list error
Mar 10 04:41:45 radix-new bgpd[12120]: neighbor 2001:7f8:1::a500:1200:1 
(AS1200-v6-01): state change Established -> Idle, reason: NOTIFICATION 
received
Mar 10 04:41:45 radix-new bgpd[12120]: neighbor 2001:7f8:1::a504:8345:2 
(XSNEWS-v6-02): received notification: error in UPDATE message, 
attribute list error
Mar 10 04:41:45 radix-new bgpd[12120]: neighbor 2001:7f8:1::a504:8345:2 
(XSNEWS-v6-02): state change Established -> Idle, reason: NOTIFICATION 
received

So this happened at at time that nobody was working on the route server. 
As you can see, LEASEWEB-v6-02 withdraws a prefix, which crashes the 
'foundry based routers'-sessions (both XSNEWS and AS1200). This lead us 
to believe that the bug was somewhere in the withdraw-code. So we digged 
through the code and found the following function: up_generate_updates 
in rde_update.c

I commented the following lines to disable the advertisement of withdraw's:

/* withdraw prefix */
         up_generate(peer, NULL, &addr, old->prefix->prefixlen);

After this, I was unable to initiate the bug. So I dugg deeper. I 
re-enabled the withdraw (undone above), and commented out the following 
code:

switch (up_test_update(peer, new)) {
         case 1:
             break;
         case 0:
             /* up_generate_updates(rules, peer, NULL, old); */
             return;
         case -1:
             return;
         }

This also fixed the problem, so I dugg deeper into up_test_update, first 
undoing the above.

I commented out the following line:

if (p == NULL)
         /* no prefix available */
         /* return (0); */

This also fixed the problem. So now I can't dig any deeper. I'm just 
wondering why a update with an empty prefix would be generated? So for 
now this is a quick and dirty fix for the problem. Once Claudio has some 
more time to digg into this, I hope there will be a real fix? I would 
look for the problem in rde_generate_updates since it is the only place 
besides the startup that calls up_generate_updates.

Kind regards,

Arnoud

On 3/9/09 8:18 PM, Elisa Jasinska wrote:
> Hi Henning and Claudio,
>
> Claudio Jeker wrote:
>    
>> Btw. does this only happen with full IPv6 feeds or are a few
>> announcements already enough?
>>      
>
> We have two test setups. One actually includes real peers, none sending
> a full table though. The other one is a setup in our lab, with various
> routers we could find, which only send a couple of routes to each other.
>
> We have seen this happening if the peer we 'clear' announces at least
> one prefix to the route server, so there is actually something to update.
>
> The behavior is different in the two setups though.
>
> With the real peers: multiple sessions go Idle upon 'clearing' one
> session and the broken UPDATE that gets send out with that, but they all
> come up again after a while.
>
> In the lab: the Idle sessions never come up completely, because the
> broken UPDATE seems to be send out repeatedly, causing the peer to go
> back to Idle immediately every time we reach an Established state.
>
> Henning Brauer wrote:
>    
>> wait. removing tcpmd5 fixes the problem? you gotta be kidding?
>> this is on OpenBSD right?
>>
>>      
>
> Sorry, this was a wrong assumption we made based on your previous post
> that there might be something wrong with it (and too many changes in our
> config at the same time ;)
>
> We are still busy with doing one change at a time now and trying to
> figure out what in the config actually causes this to happen. Once we
> get any conclusive results from this we will get back to you.
>
> Thanks a lot for your help!
>
> Regards
> Elisa

Reply via email to