Hi Remi,

Thank you very much for your time with this issue.


I was afraid that you couldn't reproduce it, since I wasn't able to
reproduce it anymore in that location.


regarding your comment:

"

The result was that for a short period of time I had two fib entries for
the service IP with different next hops on the FW. After the inactivity
timeout expires the first route is removed.

"

That's what I would expect to happen, but the route didn't disappear at
all from the FIB, it was stuck there.

But even in this event, I shouldn't get the "file exists" error, since
the next hop of the route is different from the previous route, it
should install and having at least Multipath.


Regarding:

"

Because of the log message
  send_rtmsg: action 1, prefix 10.250.250.153/32: File exists
I understand that in your case the two Ubuntu hosts advertised the service
IP both with the same next hop and ospfd could not add that route because
it was already present. It was present because the 1st box did not withdraw
it's routes. (action 1 means "add route" (RTM_ADD)).

"

Well, if the 2 hosts advertised the service IP, I don't understand how
and why. The 2nd Ubuntu host was not announcing the route because the IP
wasn't set in the interface, it is set if keepalived set's it.

When we shut the first Ubuntu box, it stops sending lsa's and keepalive
messages simultaneously, and also, the other Ubuntu box when starts to
send the lsa's related to the new prefix, send it with is own IP
address, so next-hop will be now other IP.

And this gets reflected in the OSPF RIB, what made me believe that is
something related to ospfd process not being able to install it on FIB.



Regarding pcap, I'll try to do this in other location running captures
in FW's.

I've found other place where the route is with MPATH flag without any
reason, and since the issue faced was in this situation, I'll try to
reproduce in this place.

Right now, in this new place it's like this:


root@fw1:~# route -n get 10.250.250.153/32
   route to: 10.250.250.153
destination: 10.250.250.153
       mask: 255.255.255.255
    gateway: 10.10.53.20
  interface: vlan1253
 if address: 10.10.53.18
   priority: 32 (ospf)
      flags: <UP,GATEWAY,DONE,_*MPATH*_>
     use       mtu    expire
57966509         0         0
root@fw1:~# route -n show | grep 10.250.250.153
10.250.250.153/32  10.10.53.20      _*  UGP   *_     0 57967353     -   
32 vlan1253
10.250.250.153/32  10.2.20.75         UG         0        0     -    48
vlan360
root@dc1fw1:~#


There is 2 routes available, but with different priorities as one is
OSPF and other BGP, so this shouldn't set the flag for MPATH.


I'll update this when I'm able to do this test again.


Once again, Thank you very much for your help.


Best regards,


João Alves


On 15.10.19 20:17, Remi Locherer wrote:
> Hi João,
>
> On Thu, Oct 10, 2019 at 03:01:30PM +0200, Joao Alves wrote:
>> Hello OpenBSD team,
>>
>>
>> We are facing an issue with OSPF related routes and would like to
>> request your help as it seems to be a OSPF to FIB route replication issue.
>>
>> This happened already once in a different location, that one is running
>> OpenBSD 6.3 and the site of the current report is OpenBSD 6.5
>>
>>
>> *Describing:*
>>
>>
>> We have a setup with a FW cluster of 2 hosts talking OSPF to 2 Ubuntu
>> boxes running Quagga.
>>
>>
>> The 2 Ubuntu boxes run keepalived between them to install a secondary IP
>> address on the interface, the service IP address.
>>
>> OSPF is configured to advertise this floating service IP and it's
>> advertised only when it's available in the interface.
>>
>> OSPF is configured to not become DR/BDR in Ubuntu hosts
>>
>>
>> *Initial state:*
>>
>> Service is active in ubuntu host A, everything working.
>>
>> root@fw1:~# ospfctl show nei
>> ID              Pri State        DeadTime Address         Iface     Uptime
>> (...)
>> 10.10.53.28     1   FULL/OTHER   00:00:04 10.10.53.28     vlan1353  00:16:01
>> 172.16.50.3     1   FULL/DR      00:00:04 10.10.53.27     vlan1353  03w2d10h
>> 10.10.53.29     1   FULL/OTHER   00:00:04 10.10.53.29     vlan1353  00:04:38
>>
>>
>> *Facing the issue:*
>>
>> Ubuntu host A is shutdown, keepalived converges to host B and OSPF
>> advertises the network, but service IP is unreachable.
>>
>> FW receives the correct update and we see the new nexthop correct in
>> "ospfctl show rib",
>>
>>
>> root@fw1:~# ospfctl show rib |grep  10.250.250.153  
>> 10.250.250.153/32    10.10.53.29       Intra-Area   Network   110    
>> 00:03:10
>> root@fw1:~# 
>>
>>
>> however FIB still points to old nexthop, the 10.10.53.28. The new
>> nexthop should end in .29.
>>
>>
>> root@fw1:~# route -n get 10.250.250.153
>>    route to: 10.250.250.153
>> destination: 10.250.250.153
>>        mask: 255.255.255.255
>>     gateway: 10.10.53.28
>>   interface: vlan1353
>>  if address: 10.10.53.26
>>    priority: 32 (ospf)
>>       flags: <UP,GATEWAY,DONE,MPATH>
>>      use       mtu    expire
>>     8298         0         0
>> root@fw1:~#
>>
>> in logs we see this message:
>>
>> Oct 10 07:41:53 fw1 ospfd[44713]: send_rtmsg: action 1, prefix
>> 10.250.250.153/32: File exists
>> Oct 10 07:42:03 fw1 ospfd[44713]: send_rtmsg: action 1, prefix
>> 10.250.250.153/32: File exists
>>
>>
>> This prefix is the service IP.
>>
>>
>> *The FIX (manual):*
>>
>>
>> To fix this we need to delete the route manually and since after
>> deleting it doesn't get the new route automatically installed in FIB, we
>> then reload the FIB.
>>
>> Sequence of commands:
>>
>> root@fw1:~# route -n show | grep 10.250.250
>> 10.250.250.53/32   10.10.11.155       UG         0   510172     -    48
>> vlan1150
>> 10.250.250.153/32  10.10.53.28        UGP        1    18861     -    32
>> vlan1353
>> 10.250.250.153/32  10.10.11.155       UG         0        0     -    48
>> vlan1150
>> root@fw1:~# route del 10.250.250.153/32 10.10.53.28
>> del host 10.250.250.153/32: gateway 10.10.53.28
>> root@fw1:~# route -n show | grep 10.250.250
>> 10.250.250.53/32   10.10.11.155       UG         0   510185     -    48
>> vlan1150
>> 10.250.250.153/32  10.10.11.155       UG         0     1550     -    48
>> vlan1150
>> root@fw1:~# route -n show | grep 10.250.250
>> 10.250.250.53/32   10.10.11.155       UG         0   510187     -    48
>> vlan1150
>> 10.250.250.153/32  10.10.11.155       UG         0     3806     -    48
>> vlan1150
>> root@fw1:~# route -n show | grep 10.250.250
>> 10.250.250.53/32   10.10.11.155       UG         0   510187     -    48
>> vlan1150
>> 10.250.250.153/32  10.10.11.155       UG         0     4711     -    48
>> vlan1150
>> root@fw1:~#
>> root@fw1:~# route -n show | grep 10.250.250
>> 10.250.250.53/32   10.10.11.155       UG         0   510188     -    48
>> vlan1150
>> 10.250.250.153/32  10.10.11.155       UG         0     7373     -    48
>> vlan1150
>> root@fw1:~#
>> root@fw1:~#
>> root@fw1:~# route -n show | grep 10.250.250
>> 10.250.250.53/32   10.10.11.155       UG         0   510188     -    48
>> vlan1150
>> 10.250.250.153/32  10.10.11.155       UG         0     8505     -    48
>> vlan1150
>> root@fw1:~#
>>
>>
>> root@fw1:~# ospfctl fib reload
>> reload request sent.
>> root@fw1:~# 
>>
>>
>> root@fw1:~# route -n show | grep 10.250.250
>> 10.250.250.53/32   10.10.11.155       UG         0       20     -    48
>> vlan1150
>> 10.250.250.153/32  10.10.53.29        UG         0      106     -    32
>> vlan1353
>> 10.250.250.153/32  10.10.11.155       UG         0        0     -    48
>> vlan1150
>> root@fw1:~#
>>
>>
>> At this point service is restored, and the problem is not reproduce-able
>> anymore.
>>
>> The difference I see is that the new route doesn't have the Multipath
>> flag anymore.
>>
>>
>> *Details about OpenBSD:*
>>
>> *dmesg.boot:*
>>
>> OpenBSD 6.5 (GENERIC.MP) #5: Thu Aug 29 20:38:30 CEST 2019
>     
> [..]
>  
>> *OSPF configuration*
>>
>> *FW1*
>>
>> router-id 172.16.50.2
>> redistribute static
>>
>> auth-type crypt
>> auth-md 1 "***********"
>> auth-md-keyid 1
>>
>> hello-interval 1
>> router-dead-time 5
>>
>> area 0.0.0.0 {
>>         interface lo1 { passive }
>>         interface vlan1150 { metric 1 }
>>         interface vlan1353 { metric 100 }
>>         interface vlan463 { metric 100 }
>>         interface carp464 { passive }
>>         interface carp1650 { passive }
>>         interface vlan2004 { passive }
>>         interface vlan364 { passive }
>> }
>>
>> *FW2*
>>
>>
>> router-id 172.16.50.3
>> redistribute static
>>
>> auth-type crypt
>> auth-md 1 "**********"
>> auth-md-keyid 1
>>
>> hello-interval 1
>> router-dead-time 5
>>
>> area 0.0.0.0 {
>>         interface lo1 { passive }
>>         interface vlan1150 { metric 1 }
>>         interface vlan1353 { metric 101 }
>>         interface vlan463 { metric 102 }
>>         interface carp464 { passive }
>>         interface carp1650 { passive }
>>         interface vlan2004 {
>>                         passive
>>                         metric 1000
>>                         }
>>         interface vlan364 {
>>                         passive
>>                         metric 1000
>>                         }
>> }
>>
>> *Ubuntu Hosts:*
>>
>> *Host A*
>>
>> *
>> *
>>
>> Current configuration:
>> !
>> !
>> interface ens192
>>  ip ospf authentication message-digest
>>  ip ospf dead-interval 5
>>  ip ospf hello-interval 1
>>  ip ospf message-digest-key 1 md5 **********
>>  no link-detect
>> !
>> interface lo
>> !
>> router ospf
>>  ospf router-id 10.10.53.28
>>  passive-interface default
>>  no passive-interface ens192
>>  network 10.10.53.24/29 area 0.0.0.0
>>  network 10.250.250.153/32 area 0.0.0.0
>>  area 0.0.0.0 authentication message-digest
>> !
>> line vty
>> !
>> end
>>
>>
>> *Host B*
>>
>> *
>> *
>>
>> Current configuration:
>> !
>> !
>> interface ens192
>>  ip ospf authentication message-digest
>>  ip ospf dead-interval 5
>>  ip ospf hello-interval 1
>>  ip ospf message-digest-key 1 md5 *********
>>  no link-detect
>> !
>> interface lo
>> !
>> router ospf
>>  ospf router-id 10.10.53.29
>>  passive-interface default
>>  no passive-interface ens192
>>  network 10.10.53.24/29 area 0.0.0.0
>>  network 10.250.250.153/32 area 0.0.0.0
>>  area 0.0.0.0 authentication message-digest
>> !
>> line vty
>> !
>> end
>>
>> *
>> *
>>
>> Can you please provide us help with this issue ?*
>> *
> I tried to reproduce your issue but was not successful. I tried
> with 6.5 and with -current on the "FW" side and also used
> OpenBSD-current on the "Host/Ubuntu" side. So it's not exactly the same.
>
> I assume in your scenario the first Ubuntu box just dies. Once the 2nd
> box detects this it starts announcing the service IP address. But the 1st
> box does not withdraw it's routes (send LSAs with max age).
>
> To simulate that I used "pkill -9 ospfd" on the first box and reloaded
> ospfd on the 2nd to make it start announcing the service IP which I
> configured on lo1.
>
> The result was that for a short period of time I had two fib entries for
> the service IP with different next hops on the FW. After the inactivity
> timeout expires the first route is removed.
>
> Because of the log message
>   send_rtmsg: action 1, prefix 10.250.250.153/32: File exists
> I understand that in your case the two Ubuntu hosts advertised the service
> IP both with the same next hop and ospfd could not add that route because
> it was already present. It was present because the 1st box did not withdraw
> it's routes. (action 1 means "add route" (RTM_ADD)).
>
> You get the same error message when adding a static route twice:
>   r1# route add 1.1.1.1 192.168.250.1
>   add host 1.1.1.1: gateway 192.168.250.1
>   r1# route add 1.1.1.1 192.168.250.1
>   add host 1.1.1.1: gateway 192.168.250.1: File exists
>
> Could you share a pcap file with the OSPF traffic during this failover?
> With that I could check if my theory is correct or if ospfd is doing something
> wrong.
>
> Remi

Reply via email to