And here are the patches :-)

On Tue, Sep 27, 2016 at 4:09 PM Neil Jerram <n...@tigera.io> wrote:

> Hi BIRD users!
>
> Attached are 3 patches that my team has been using for routing through
> IP-in-IP tunnels, rebased on 1.6.1.  I'd like to explain why we find them
> useful, and start a conversation about whether they or something like them
> could be upstreamed (or perhaps if there's some better way of achieving our
> aims).
>
> Calico [1] uses BIRD for BGP routing between the hosts in various cloud
> orchestration systems (Kubernetes, OpenStack etc.), to distribute routes to
> the pods/VMs/containers in those systems, each of which has its own IP.  If
> all the hosts are directly connected to each other, this is
> straightforward, but sometimes they are not.  For example GCE instances are
> not directly connected to each other: there is at least one router between
> them, that knows about routing GCE addresses, and to/from the Internet, and
> we cannot peer with it or otherwise tell it how to route pod/VM/container
> IPs.  So if we use GCE to create e.g. OpenStack compute hosts, with Calico
> networking, we need to do something extra to allow VM-addressed data to
> pass between the compute hosts.
>
> One of our solutions is to use IP-in-IP; it works as shown by this diagram:
>
>        10.65.0.3 via 10.240.0.5 dev tunl0 onlink
>        default via 10.240.0.1
>                |
>              +-|----------+                             +------------+
>              | o          |                             |            |
>              |   Host A   |         +--------+          |   Host B   |
>              |            |---------| Router |----------|            |
>              | 10.240.0.4 |         +--------+          | 10.240.0.5 |
>              |            |---.                         |            |
>              +------------+    |                        +------------+
>                ^       ^   +---v---+                                |
>  src 10.65.0.2 |       |   | tunl0 |                                |
>  dst 10.65.0.3 |       |   +-------+                                |
>                |        \      |                                    v
>          +-----------+   '----'                               +-----------+
>          |   Pod A   |      src 10.240.0.4                    |   Pod B   |
>          | 10.65.0.2 |      dst 10.240.0.5                    | 10.65.0.3 |
>          +-----------+          ------                        +-----------+
>                              src 10.65.0.2
>                              dst 10.65.0.3
>
> The diagram shows Pod A sending a packet to Pod B, using IP addresses that
> are unknown to the 'Router' between the two hosts.  Host A has an IP-in-IP
> device, tunl0, and a route that says to use that device for data to Pod B's
> address (10.65.0.3).  When the packet has passed through that device, it
> has a new outer IP header, with src 10.240.0.4 and dst 10.240.0.5, and is
> routed again according to the routing table - so now it can successfully
> reach Host B.
>
> So how is BIRD involved?  We statically program the local Pod route on
> each host:
>
> On Host A: 10.65.0.2 dev <interface to Pod A>
> On Host B: 10.65.0.3 dev <interface to Pod B>
>
> then run a BIRD BGP session between Host A and Host B to propagate those
> routes to the other host - which would normally give us:
>
> On Host A: 10.65.0.3 via 10.240.0.5
> On Host B: 10.65.0.2 via 10.240.0.4
>
> But we don't want those normal routes, because then the data would get
> lost at 'Router'.  So we enhance and configure BIRD as follows.
>
> - In the export filter for protocol kernel, for the relevant routes, we
> set an attribute 'krt_tunnel = tunl0'.
>
> - We modify BIRD, as in the attached patches, to understand that that
> means that those routes should have 'dev tunl0'.
>
> Then instead, we get:
>
> On Host A: 10.65.0.3 via 10.240.0.5 dev tunl0 onlink
> On Host B: 10.65.0.2 via 10.240.0.4 dev tunl0 onlink
>
> which allows successful routing of data between the Pods.
>
>
> Thanks for reading this far!  I now have three questions:
>
> 1. Does the routing approach above make sense?  (Or is there some better
> or simpler or already supported way that we could achieve the same thing?)
>
> 2. If (1), would the BIRD team accept patches broadly on the lines of
> those that are attached?
>
> 3. If (2), please let me know if the attached patches are already
> acceptable, or otherwise what further work is needed for them.
>
> Many thanks,
>     Neil
>
>
From bed50e27dd14aa98a89a2c9e0e7a63a87bcaa830 Mon Sep 17 00:00:00 2001
From: Shaun Crampton <shaun.cramp...@metaswitch.com>
Date: Wed, 17 Jun 2015 16:14:41 -0700
Subject: [PATCH 3/3] Disable recursive route check for GCE compatibility.

GCE uses a /32 for VM IPs with a default gateway that is, thus,
off-subnet.  This patch removes a check in BIRD that prevents
BIRD from accepting such a route as a valid next hop.

The following trail asserts that the check is not really needed
but that it is normally a useful sanity check:

https://www.mail-archive.com/bird-users@atrey.karlin.mff.cuni.cz/msg02600.html

Ideally GCE would set the online flag on the route or use a
link-local IP instead but that is unlikely to change.
---
 sysdep/linux/netlink.c | 6 ++----
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/sysdep/linux/netlink.c b/sysdep/linux/netlink.c
index d754df7..3ac5b27 100644
--- a/sysdep/linux/netlink.c
+++ b/sysdep/linux/netlink.c
@@ -521,8 +521,7 @@ nl_parse_multipath(struct krt_proto *p, struct rtattr *ra)
 	  memcpy(&rv->gw, RTA_DATA(a[RTA_GATEWAY]), sizeof(ip_addr));
 	  ipa_ntoh(rv->gw);
 
-	  neighbor *ng = neigh_find2(&p->p, &rv->gw, rv->iface,
-				     (nh->rtnh_flags & RTNH_F_ONLINK) ? NEF_ONLINK : 0);
+	  neighbor *ng = neigh_find2(&p->p, &rv->gw, rv->iface, NEF_ONLINK);
 	  if (!ng || (ng->scope == SCOPE_HOST))
 	    return NULL;
 	}
@@ -1290,8 +1289,7 @@ nl_parse_route(struct nl_parse_state *s, struct nlmsghdr *h)
 	    return;
 #endif
 
-	  ng = neigh_find2(&p->p, &ra->gw, ra->iface,
-			   (i->rtm_flags & RTNH_F_ONLINK) ? NEF_ONLINK : 0);
+	  ng = neigh_find2(&p->p, &ra->gw, ra->iface, NEF_ONLINK);
 	  if (!ng || (ng->scope == SCOPE_HOST))
 	    {
 	      log(L_ERR "KRT: Received route %I/%d with strange next-hop %I",
-- 
2.7.4

From feda3ba2effd74936f7ba1b03da1e30a9dee3a16 Mon Sep 17 00:00:00 2001
From: Mike Evans <mike.ev...@metaswitch.com>
Date: Thu, 4 Jun 2015 07:50:08 +0000
Subject: [PATCH 2/3] Add kernel tunnel dynamic attribute

For IP-in-IP routes, use original next hop received in BGP UPDATE

Instead of using a recursively resolved next hop, which will probably
be to somewhere half way along the tunnel, from where packets won't be
able to be routed any further.
---
 nest/route.h           |  1 +
 proto/bgp/packets.c    |  1 +
 sysdep/linux/krt-sys.h |  1 +
 sysdep/linux/netlink.Y |  4 +++-
 sysdep/linux/netlink.c | 23 +++++++++++++++++++++--
 5 files changed, 27 insertions(+), 3 deletions(-)

diff --git a/nest/route.h b/nest/route.h
index 2bfd066..3eb86c8 100644
--- a/nest/route.h
+++ b/nest/route.h
@@ -359,6 +359,7 @@ typedef struct rta {
   byte aflags;				/* Attribute cache flags (RTAF_...) */
   u16 hash_key;				/* Hash over important fields */
   u32 igp_metric;			/* IGP metric to next hop (for iBGP routes) */
+  ip_addr orig_gw;			/* Original next hop from BGP UPDATE */
   ip_addr gw;				/* Next hop */
   ip_addr from;				/* Advertising router */
   struct hostentry *hostentry;		/* Hostentry for recursive next-hops */
diff --git a/proto/bgp/packets.c b/proto/bgp/packets.c
index 0cf38ed..fba9d60 100644
--- a/proto/bgp/packets.c
+++ b/proto/bgp/packets.c
@@ -1180,6 +1180,7 @@ bgp_set_next_hop(struct bgp_proto *p, rta *a)
       if (ipa_zero(*nexthop))
 	  return 0;
 
+      a->orig_gw = *nexthop;
       rta_set_recursive_next_hop(p->p.table, a, p->igp_table, nexthop, nexthop + second);
     }
 
diff --git a/sysdep/linux/krt-sys.h b/sysdep/linux/krt-sys.h
index 6d6586d..cd4d2ec 100644
--- a/sysdep/linux/krt-sys.h
+++ b/sysdep/linux/krt-sys.h
@@ -37,6 +37,7 @@ static inline struct ifa * kif_get_primary_ip(struct iface *i) { return NULL; }
 #define EA_KRT_PREFSRC		EA_CODE(EAP_KRT, 0x10)
 #define EA_KRT_REALM		EA_CODE(EAP_KRT, 0x11)
 #define EA_KRT_SCOPE		EA_CODE(EAP_KRT, 0x12)
+#define EA_KRT_TUNNEL		EA_CODE(EAP_KRT, 0x13)
 
 
 #define KRT_METRICS_MAX		0x10	/* RTAX_QUICKACK+1 */
diff --git a/sysdep/linux/netlink.Y b/sysdep/linux/netlink.Y
index f577244..4266333 100644
--- a/sysdep/linux/netlink.Y
+++ b/sysdep/linux/netlink.Y
@@ -15,7 +15,8 @@ CF_KEYWORDS(KERNEL, TABLE, METRIC, KRT_PREFSRC, KRT_REALM, KRT_SCOPE, KRT_MTU, K
 	    KRT_HOPLIMIT, KRT_INITCWND, KRT_RTO_MIN, KRT_INITRWND, KRT_QUICKACK,
 	    KRT_LOCK_MTU, KRT_LOCK_WINDOW, KRT_LOCK_RTT, KRT_LOCK_RTTVAR,
 	    KRT_LOCK_SSTRESH, KRT_LOCK_CWND, KRT_LOCK_ADVMSS, KRT_LOCK_REORDERING,
-	    KRT_LOCK_HOPLIMIT, KRT_LOCK_RTO_MIN, KRT_FEATURE_ECN, KRT_FEATURE_ALLFRAG)
+	    KRT_LOCK_HOPLIMIT, KRT_LOCK_RTO_MIN, KRT_FEATURE_ECN, KRT_FEATURE_ALLFRAG,
+	    KRT_TUNNEL)
 
 CF_GRAMMAR
 
@@ -29,6 +30,7 @@ kern_sys_item:
 CF_ADDTO(dynamic_attr, KRT_PREFSRC	{ $$ = f_new_dynamic_attr(EAF_TYPE_IP_ADDRESS, T_IP, EA_KRT_PREFSRC); })
 CF_ADDTO(dynamic_attr, KRT_REALM	{ $$ = f_new_dynamic_attr(EAF_TYPE_INT, T_INT, EA_KRT_REALM); })
 CF_ADDTO(dynamic_attr, KRT_SCOPE	{ $$ = f_new_dynamic_attr(EAF_TYPE_INT, T_INT, EA_KRT_SCOPE); })
+CF_ADDTO(dynamic_attr, KRT_TUNNEL	{ $$ = f_new_dynamic_attr(EAF_TYPE_STRING, T_STRING, EA_KRT_TUNNEL); })
 
 CF_ADDTO(dynamic_attr, KRT_MTU		{ $$ = f_new_dynamic_attr(EAF_TYPE_INT, T_INT, EA_KRT_MTU); })
 CF_ADDTO(dynamic_attr, KRT_WINDOW	{ $$ = f_new_dynamic_attr(EAF_TYPE_INT, T_INT, EA_KRT_WINDOW); })
diff --git a/sysdep/linux/netlink.c b/sysdep/linux/netlink.c
index 1490213..d754df7 100644
--- a/sysdep/linux/netlink.c
+++ b/sysdep/linux/netlink.c
@@ -887,6 +887,7 @@ nl_send_route(struct krt_proto *p, rte *e, struct ea_list *eattrs, int op, int d
     struct rtmsg r;
     char buf[128 + KRT_METRICS_MAX*8 + nh_bufsize(a->nexthops)];
   } r;
+  struct iface* i;
 
   DBG("nl_send_route(%I/%d,op=%x)\n", net->n.prefix, net->n.pxlen, op);
 
@@ -962,8 +963,22 @@ dest:
     {
     case RTD_ROUTER:
       r.r.rtm_type = RTN_UNICAST;
-      nl_add_attr_u32(&r.h, sizeof(r), RTA_OIF, iface->index);
-      nl_add_attr_ipa(&r.h, sizeof(r), RTA_GATEWAY, gw);
+      if ((ea = ea_find(eattrs, EA_KRT_TUNNEL)) &&
+          (i = if_find_by_name(ea->u.ptr->data)))
+      {
+        /*
+         * Tunnel attribute is set, so set the route up using the specified tunnel device
+         * to the originator of the route.
+         */
+        nl_add_attr_u32(&r.h, sizeof(r), RTA_OIF, i->index);
+        nl_add_attr_ipa(&r.h, sizeof(r), RTA_GATEWAY, a->orig_gw);
+        r.r.rtm_flags |= RTNH_F_ONLINK;
+      }
+      else
+      {
+         nl_add_attr_u32(&r.h, sizeof(r), RTA_OIF, iface->index);
+         nl_add_attr_ipa(&r.h, sizeof(r), RTA_GATEWAY, gw);
+      }
       break;
     case RTD_DEVICE:
       r.r.rtm_type = RTN_UNICAST;
@@ -1652,6 +1667,10 @@ krt_sys_get_attr(eattr *a, byte *buf, int buflen UNUSED)
     bsprintf(buf, "scope");
     return GA_NAME;
 
+  case EA_KRT_TUNNEL:
+    bsprintf(buf, "tunnel");
+    return GA_NAME;
+
   case EA_KRT_LOCK:
     buf += bsprintf(buf, "lock:");
     ea_format_bitfield(a, buf, buflen, krt_metrics_names, 2, KRT_METRICS_MAX);
-- 
2.7.4

From b06a7cf7b9d34201c57baaa525003a9d71a0f2ff Mon Sep 17 00:00:00 2001
From: Mike Evans <mike.ev...@metaswitch.com>
Date: Thu, 4 Jun 2015 12:05:39 +0000
Subject: [PATCH 1/3] Add support for EAF_TYPE_STRING - ie. read-write string
 extended attributes

---
 filter/filter.c | 20 ++++++++++++++++++--
 nest/route.h    |  1 +
 2 files changed, 19 insertions(+), 2 deletions(-)

diff --git a/filter/filter.c b/filter/filter.c
index 2f5f00d..b2f137f 100644
--- a/filter/filter.c
+++ b/filter/filter.c
@@ -555,6 +555,8 @@ interpret(struct f_inst *what)
   unsigned u1, u2;
   int i;
   u32 as;
+  int len;
+  struct adata *ad;
 
   res.type = T_VOID;
   if (!what)
@@ -922,6 +924,10 @@ interpret(struct f_inst *what)
 	res.type = T_INT;
 	res.val.i = e->u.data;
 	break;
+      case EAF_TYPE_STRING:
+        res.type = T_STRING;
+        res.val.s = e->u.ptr->data;
+        break;
       case EAF_TYPE_ROUTER_ID:
 	res.type = T_QUAD;
 	res.val.i = e->u.data;
@@ -980,6 +986,16 @@ interpret(struct f_inst *what)
 	l->attrs[0].u.data = v1.val.i;
 	break;
 
+      case EAF_TYPE_STRING:
+        if (v1.type != T_STRING)
+          runtime( "Setting string attribute to non-string value" );
+        len = strlen(v1.val.s) + 1;
+	ad = lp_alloc(f_pool, sizeof(struct adata) + len);
+        ad->length = len;
+        strcpy(ad->data, v1.val.s);
+        l->attrs[0].u.ptr = ad;
+        break;
+
       case EAF_TYPE_ROUTER_ID:
 #ifndef IPV6
 	/* IP->Quad implicit conversion */
@@ -1000,8 +1016,8 @@ interpret(struct f_inst *what)
       case EAF_TYPE_IP_ADDRESS:
 	if (v1.type != T_IP)
 	  runtime( "Setting ip attribute to non-ip value" );
-	int len = sizeof(ip_addr);
-	struct adata *ad = lp_alloc(f_pool, sizeof(struct adata) + len);
+	len = sizeof(ip_addr);
+	ad = lp_alloc(f_pool, sizeof(struct adata) + len);
 	ad->length = len;
 	(* (ip_addr *) ad->data) = v1.val.px.ip;
 	l->attrs[0].u.ptr = ad;
diff --git a/nest/route.h b/nest/route.h
index 2fcb189..2bfd066 100644
--- a/nest/route.h
+++ b/nest/route.h
@@ -445,6 +445,7 @@ typedef struct eattr {
 #define EAF_TYPE_MASK 0x0f		/* Mask with this to get type */
 #define EAF_TYPE_INT 0x01		/* 32-bit unsigned integer number */
 #define EAF_TYPE_OPAQUE 0x02		/* Opaque byte string (not filterable) */
+#define EAF_TYPE_STRING 0x03            /* String */
 #define EAF_TYPE_IP_ADDRESS 0x04	/* IP address */
 #define EAF_TYPE_ROUTER_ID 0x05		/* Router ID (IPv4 address) */
 #define EAF_TYPE_AS_PATH 0x06		/* BGP AS path (encoding per RFC 1771:4.3) */
-- 
2.7.4

Reply via email to