Hi João,

On Thu, Oct 10, 2019 at 03:01:30PM +0200, Joao Alves wrote:
> Hello OpenBSD team,
> 
> 
> We are facing an issue with OSPF related routes and would like to
> request your help as it seems to be a OSPF to FIB route replication issue.
> 
> This happened already once in a different location, that one is running
> OpenBSD 6.3 and the site of the current report is OpenBSD 6.5
> 
> 
> *Describing:*
> 
> 
> We have a setup with a FW cluster of 2 hosts talking OSPF to 2 Ubuntu
> boxes running Quagga.
> 
> 
> The 2 Ubuntu boxes run keepalived between them to install a secondary IP
> address on the interface, the service IP address.
> 
> OSPF is configured to advertise this floating service IP and it's
> advertised only when it's available in the interface.
> 
> OSPF is configured to not become DR/BDR in Ubuntu hosts
> 
> 
> *Initial state:*
> 
> Service is active in ubuntu host A, everything working.
> 
> root@fw1:~# ospfctl show nei
> ID              Pri State        DeadTime Address         Iface     Uptime
> (...)
> 10.10.53.28     1   FULL/OTHER   00:00:04 10.10.53.28     vlan1353  00:16:01
> 172.16.50.3     1   FULL/DR      00:00:04 10.10.53.27     vlan1353  03w2d10h
> 10.10.53.29     1   FULL/OTHER   00:00:04 10.10.53.29     vlan1353  00:04:38
> 
> 
> *Facing the issue:*
> 
> Ubuntu host A is shutdown, keepalived converges to host B and OSPF
> advertises the network, but service IP is unreachable.
> 
> FW receives the correct update and we see the new nexthop correct in
> "ospfctl show rib",
> 
> 
> root@fw1:~# ospfctl show rib |grep  10.250.250.153  
> 10.250.250.153/32    10.10.53.29       Intra-Area   Network   110    
> 00:03:10
> root@fw1:~# 
> 
> 
> however FIB still points to old nexthop, the 10.10.53.28. The new
> nexthop should end in .29.
> 
> 
> root@fw1:~# route -n get 10.250.250.153
>    route to: 10.250.250.153
> destination: 10.250.250.153
>        mask: 255.255.255.255
>     gateway: 10.10.53.28
>   interface: vlan1353
>  if address: 10.10.53.26
>    priority: 32 (ospf)
>       flags: <UP,GATEWAY,DONE,MPATH>
>      use       mtu    expire
>     8298         0         0
> root@fw1:~#
> 
> in logs we see this message:
> 
> Oct 10 07:41:53 fw1 ospfd[44713]: send_rtmsg: action 1, prefix
> 10.250.250.153/32: File exists
> Oct 10 07:42:03 fw1 ospfd[44713]: send_rtmsg: action 1, prefix
> 10.250.250.153/32: File exists
> 
> 
> This prefix is the service IP.
> 
> 
> *The FIX (manual):*
> 
> 
> To fix this we need to delete the route manually and since after
> deleting it doesn't get the new route automatically installed in FIB, we
> then reload the FIB.
> 
> Sequence of commands:
> 
> root@fw1:~# route -n show | grep 10.250.250
> 10.250.250.53/32   10.10.11.155       UG         0   510172     -    48
> vlan1150
> 10.250.250.153/32  10.10.53.28        UGP        1    18861     -    32
> vlan1353
> 10.250.250.153/32  10.10.11.155       UG         0        0     -    48
> vlan1150
> root@fw1:~# route del 10.250.250.153/32 10.10.53.28
> del host 10.250.250.153/32: gateway 10.10.53.28
> root@fw1:~# route -n show | grep 10.250.250
> 10.250.250.53/32   10.10.11.155       UG         0   510185     -    48
> vlan1150
> 10.250.250.153/32  10.10.11.155       UG         0     1550     -    48
> vlan1150
> root@fw1:~# route -n show | grep 10.250.250
> 10.250.250.53/32   10.10.11.155       UG         0   510187     -    48
> vlan1150
> 10.250.250.153/32  10.10.11.155       UG         0     3806     -    48
> vlan1150
> root@fw1:~# route -n show | grep 10.250.250
> 10.250.250.53/32   10.10.11.155       UG         0   510187     -    48
> vlan1150
> 10.250.250.153/32  10.10.11.155       UG         0     4711     -    48
> vlan1150
> root@fw1:~#
> root@fw1:~# route -n show | grep 10.250.250
> 10.250.250.53/32   10.10.11.155       UG         0   510188     -    48
> vlan1150
> 10.250.250.153/32  10.10.11.155       UG         0     7373     -    48
> vlan1150
> root@fw1:~#
> root@fw1:~#
> root@fw1:~# route -n show | grep 10.250.250
> 10.250.250.53/32   10.10.11.155       UG         0   510188     -    48
> vlan1150
> 10.250.250.153/32  10.10.11.155       UG         0     8505     -    48
> vlan1150
> root@fw1:~#
> 
> 
> root@fw1:~# ospfctl fib reload
> reload request sent.
> root@fw1:~# 
> 
> 
> root@fw1:~# route -n show | grep 10.250.250
> 10.250.250.53/32   10.10.11.155       UG         0       20     -    48
> vlan1150
> 10.250.250.153/32  10.10.53.29        UG         0      106     -    32
> vlan1353
> 10.250.250.153/32  10.10.11.155       UG         0        0     -    48
> vlan1150
> root@fw1:~#
> 
> 
> At this point service is restored, and the problem is not reproduce-able
> anymore.
> 
> The difference I see is that the new route doesn't have the Multipath
> flag anymore.
> 
> 
> *Details about OpenBSD:*
> 
> *dmesg.boot:*
> 
> OpenBSD 6.5 (GENERIC.MP) #5: Thu Aug 29 20:38:30 CEST 2019
    
[..]
 
> 
> *OSPF configuration*
> 
> *FW1*
> 
> router-id 172.16.50.2
> redistribute static
> 
> auth-type crypt
> auth-md 1 "***********"
> auth-md-keyid 1
> 
> hello-interval 1
> router-dead-time 5
> 
> area 0.0.0.0 {
>         interface lo1 { passive }
>         interface vlan1150 { metric 1 }
>         interface vlan1353 { metric 100 }
>         interface vlan463 { metric 100 }
>         interface carp464 { passive }
>         interface carp1650 { passive }
>         interface vlan2004 { passive }
>         interface vlan364 { passive }
> }
> 
> *FW2*
> 
> 
> router-id 172.16.50.3
> redistribute static
> 
> auth-type crypt
> auth-md 1 "**********"
> auth-md-keyid 1
> 
> hello-interval 1
> router-dead-time 5
> 
> area 0.0.0.0 {
>         interface lo1 { passive }
>         interface vlan1150 { metric 1 }
>         interface vlan1353 { metric 101 }
>         interface vlan463 { metric 102 }
>         interface carp464 { passive }
>         interface carp1650 { passive }
>         interface vlan2004 {
>                         passive
>                         metric 1000
>                         }
>         interface vlan364 {
>                         passive
>                         metric 1000
>                         }
> }
> 
> *Ubuntu Hosts:*
> 
> *Host A*
> 
> *
> *
> 
> Current configuration:
> !
> !
> interface ens192
>  ip ospf authentication message-digest
>  ip ospf dead-interval 5
>  ip ospf hello-interval 1
>  ip ospf message-digest-key 1 md5 **********
>  no link-detect
> !
> interface lo
> !
> router ospf
>  ospf router-id 10.10.53.28
>  passive-interface default
>  no passive-interface ens192
>  network 10.10.53.24/29 area 0.0.0.0
>  network 10.250.250.153/32 area 0.0.0.0
>  area 0.0.0.0 authentication message-digest
> !
> line vty
> !
> end
> 
> 
> *Host B*
> 
> *
> *
> 
> Current configuration:
> !
> !
> interface ens192
>  ip ospf authentication message-digest
>  ip ospf dead-interval 5
>  ip ospf hello-interval 1
>  ip ospf message-digest-key 1 md5 *********
>  no link-detect
> !
> interface lo
> !
> router ospf
>  ospf router-id 10.10.53.29
>  passive-interface default
>  no passive-interface ens192
>  network 10.10.53.24/29 area 0.0.0.0
>  network 10.250.250.153/32 area 0.0.0.0
>  area 0.0.0.0 authentication message-digest
> !
> line vty
> !
> end
> 
> *
> *
> 
> Can you please provide us help with this issue ?*
> *

I tried to reproduce your issue but was not successful. I tried
with 6.5 and with -current on the "FW" side and also used
OpenBSD-current on the "Host/Ubuntu" side. So it's not exactly the same.

I assume in your scenario the first Ubuntu box just dies. Once the 2nd
box detects this it starts announcing the service IP address. But the 1st
box does not withdraw it's routes (send LSAs with max age).

To simulate that I used "pkill -9 ospfd" on the first box and reloaded
ospfd on the 2nd to make it start announcing the service IP which I
configured on lo1.

The result was that for a short period of time I had two fib entries for
the service IP with different next hops on the FW. After the inactivity
timeout expires the first route is removed.

Because of the log message
  send_rtmsg: action 1, prefix 10.250.250.153/32: File exists
I understand that in your case the two Ubuntu hosts advertised the service
IP both with the same next hop and ospfd could not add that route because
it was already present. It was present because the 1st box did not withdraw
it's routes. (action 1 means "add route" (RTM_ADD)).

You get the same error message when adding a static route twice:
  r1# route add 1.1.1.1 192.168.250.1
  add host 1.1.1.1: gateway 192.168.250.1
  r1# route add 1.1.1.1 192.168.250.1
  add host 1.1.1.1: gateway 192.168.250.1: File exists

Could you share a pcap file with the OSPF traffic during this failover?
With that I could check if my theory is correct or if ospfd is doing something
wrong.

Remi

Reply via email to