And here are the patches :-) On Tue, Sep 27, 2016 at 4:09 PM Neil Jerram <n...@tigera.io> wrote:
> Hi BIRD users! > > Attached are 3 patches that my team has been using for routing through > IP-in-IP tunnels, rebased on 1.6.1. I'd like to explain why we find them > useful, and start a conversation about whether they or something like them > could be upstreamed (or perhaps if there's some better way of achieving our > aims). > > Calico [1] uses BIRD for BGP routing between the hosts in various cloud > orchestration systems (Kubernetes, OpenStack etc.), to distribute routes to > the pods/VMs/containers in those systems, each of which has its own IP. If > all the hosts are directly connected to each other, this is > straightforward, but sometimes they are not. For example GCE instances are > not directly connected to each other: there is at least one router between > them, that knows about routing GCE addresses, and to/from the Internet, and > we cannot peer with it or otherwise tell it how to route pod/VM/container > IPs. So if we use GCE to create e.g. OpenStack compute hosts, with Calico > networking, we need to do something extra to allow VM-addressed data to > pass between the compute hosts. > > One of our solutions is to use IP-in-IP; it works as shown by this diagram: > > 10.65.0.3 via 10.240.0.5 dev tunl0 onlink > default via 10.240.0.1 > | > +-|----------+ +------------+ > | o | | | > | Host A | +--------+ | Host B | > | |---------| Router |----------| | > | 10.240.0.4 | +--------+ | 10.240.0.5 | > | |---. | | > +------------+ | +------------+ > ^ ^ +---v---+ | > src 10.65.0.2 | | | tunl0 | | > dst 10.65.0.3 | | +-------+ | > | \ | v > +-----------+ '----' +-----------+ > | Pod A | src 10.240.0.4 | Pod B | > | 10.65.0.2 | dst 10.240.0.5 | 10.65.0.3 | > +-----------+ ------ +-----------+ > src 10.65.0.2 > dst 10.65.0.3 > > The diagram shows Pod A sending a packet to Pod B, using IP addresses that > are unknown to the 'Router' between the two hosts. Host A has an IP-in-IP > device, tunl0, and a route that says to use that device for data to Pod B's > address (10.65.0.3). When the packet has passed through that device, it > has a new outer IP header, with src 10.240.0.4 and dst 10.240.0.5, and is > routed again according to the routing table - so now it can successfully > reach Host B. > > So how is BIRD involved? We statically program the local Pod route on > each host: > > On Host A: 10.65.0.2 dev <interface to Pod A> > On Host B: 10.65.0.3 dev <interface to Pod B> > > then run a BIRD BGP session between Host A and Host B to propagate those > routes to the other host - which would normally give us: > > On Host A: 10.65.0.3 via 10.240.0.5 > On Host B: 10.65.0.2 via 10.240.0.4 > > But we don't want those normal routes, because then the data would get > lost at 'Router'. So we enhance and configure BIRD as follows. > > - In the export filter for protocol kernel, for the relevant routes, we > set an attribute 'krt_tunnel = tunl0'. > > - We modify BIRD, as in the attached patches, to understand that that > means that those routes should have 'dev tunl0'. > > Then instead, we get: > > On Host A: 10.65.0.3 via 10.240.0.5 dev tunl0 onlink > On Host B: 10.65.0.2 via 10.240.0.4 dev tunl0 onlink > > which allows successful routing of data between the Pods. > > > Thanks for reading this far! I now have three questions: > > 1. Does the routing approach above make sense? (Or is there some better > or simpler or already supported way that we could achieve the same thing?) > > 2. If (1), would the BIRD team accept patches broadly on the lines of > those that are attached? > > 3. If (2), please let me know if the attached patches are already > acceptable, or otherwise what further work is needed for them. > > Many thanks, > Neil > >
From bed50e27dd14aa98a89a2c9e0e7a63a87bcaa830 Mon Sep 17 00:00:00 2001 From: Shaun Crampton <shaun.cramp...@metaswitch.com> Date: Wed, 17 Jun 2015 16:14:41 -0700 Subject: [PATCH 3/3] Disable recursive route check for GCE compatibility. GCE uses a /32 for VM IPs with a default gateway that is, thus, off-subnet. This patch removes a check in BIRD that prevents BIRD from accepting such a route as a valid next hop. The following trail asserts that the check is not really needed but that it is normally a useful sanity check: https://www.mail-archive.com/bird-users@atrey.karlin.mff.cuni.cz/msg02600.html Ideally GCE would set the online flag on the route or use a link-local IP instead but that is unlikely to change. --- sysdep/linux/netlink.c | 6 ++---- 1 file changed, 2 insertions(+), 4 deletions(-) diff --git a/sysdep/linux/netlink.c b/sysdep/linux/netlink.c index d754df7..3ac5b27 100644 --- a/sysdep/linux/netlink.c +++ b/sysdep/linux/netlink.c @@ -521,8 +521,7 @@ nl_parse_multipath(struct krt_proto *p, struct rtattr *ra) memcpy(&rv->gw, RTA_DATA(a[RTA_GATEWAY]), sizeof(ip_addr)); ipa_ntoh(rv->gw); - neighbor *ng = neigh_find2(&p->p, &rv->gw, rv->iface, - (nh->rtnh_flags & RTNH_F_ONLINK) ? NEF_ONLINK : 0); + neighbor *ng = neigh_find2(&p->p, &rv->gw, rv->iface, NEF_ONLINK); if (!ng || (ng->scope == SCOPE_HOST)) return NULL; } @@ -1290,8 +1289,7 @@ nl_parse_route(struct nl_parse_state *s, struct nlmsghdr *h) return; #endif - ng = neigh_find2(&p->p, &ra->gw, ra->iface, - (i->rtm_flags & RTNH_F_ONLINK) ? NEF_ONLINK : 0); + ng = neigh_find2(&p->p, &ra->gw, ra->iface, NEF_ONLINK); if (!ng || (ng->scope == SCOPE_HOST)) { log(L_ERR "KRT: Received route %I/%d with strange next-hop %I", -- 2.7.4
From feda3ba2effd74936f7ba1b03da1e30a9dee3a16 Mon Sep 17 00:00:00 2001 From: Mike Evans <mike.ev...@metaswitch.com> Date: Thu, 4 Jun 2015 07:50:08 +0000 Subject: [PATCH 2/3] Add kernel tunnel dynamic attribute For IP-in-IP routes, use original next hop received in BGP UPDATE Instead of using a recursively resolved next hop, which will probably be to somewhere half way along the tunnel, from where packets won't be able to be routed any further. --- nest/route.h | 1 + proto/bgp/packets.c | 1 + sysdep/linux/krt-sys.h | 1 + sysdep/linux/netlink.Y | 4 +++- sysdep/linux/netlink.c | 23 +++++++++++++++++++++-- 5 files changed, 27 insertions(+), 3 deletions(-) diff --git a/nest/route.h b/nest/route.h index 2bfd066..3eb86c8 100644 --- a/nest/route.h +++ b/nest/route.h @@ -359,6 +359,7 @@ typedef struct rta { byte aflags; /* Attribute cache flags (RTAF_...) */ u16 hash_key; /* Hash over important fields */ u32 igp_metric; /* IGP metric to next hop (for iBGP routes) */ + ip_addr orig_gw; /* Original next hop from BGP UPDATE */ ip_addr gw; /* Next hop */ ip_addr from; /* Advertising router */ struct hostentry *hostentry; /* Hostentry for recursive next-hops */ diff --git a/proto/bgp/packets.c b/proto/bgp/packets.c index 0cf38ed..fba9d60 100644 --- a/proto/bgp/packets.c +++ b/proto/bgp/packets.c @@ -1180,6 +1180,7 @@ bgp_set_next_hop(struct bgp_proto *p, rta *a) if (ipa_zero(*nexthop)) return 0; + a->orig_gw = *nexthop; rta_set_recursive_next_hop(p->p.table, a, p->igp_table, nexthop, nexthop + second); } diff --git a/sysdep/linux/krt-sys.h b/sysdep/linux/krt-sys.h index 6d6586d..cd4d2ec 100644 --- a/sysdep/linux/krt-sys.h +++ b/sysdep/linux/krt-sys.h @@ -37,6 +37,7 @@ static inline struct ifa * kif_get_primary_ip(struct iface *i) { return NULL; } #define EA_KRT_PREFSRC EA_CODE(EAP_KRT, 0x10) #define EA_KRT_REALM EA_CODE(EAP_KRT, 0x11) #define EA_KRT_SCOPE EA_CODE(EAP_KRT, 0x12) +#define EA_KRT_TUNNEL EA_CODE(EAP_KRT, 0x13) #define KRT_METRICS_MAX 0x10 /* RTAX_QUICKACK+1 */ diff --git a/sysdep/linux/netlink.Y b/sysdep/linux/netlink.Y index f577244..4266333 100644 --- a/sysdep/linux/netlink.Y +++ b/sysdep/linux/netlink.Y @@ -15,7 +15,8 @@ CF_KEYWORDS(KERNEL, TABLE, METRIC, KRT_PREFSRC, KRT_REALM, KRT_SCOPE, KRT_MTU, K KRT_HOPLIMIT, KRT_INITCWND, KRT_RTO_MIN, KRT_INITRWND, KRT_QUICKACK, KRT_LOCK_MTU, KRT_LOCK_WINDOW, KRT_LOCK_RTT, KRT_LOCK_RTTVAR, KRT_LOCK_SSTRESH, KRT_LOCK_CWND, KRT_LOCK_ADVMSS, KRT_LOCK_REORDERING, - KRT_LOCK_HOPLIMIT, KRT_LOCK_RTO_MIN, KRT_FEATURE_ECN, KRT_FEATURE_ALLFRAG) + KRT_LOCK_HOPLIMIT, KRT_LOCK_RTO_MIN, KRT_FEATURE_ECN, KRT_FEATURE_ALLFRAG, + KRT_TUNNEL) CF_GRAMMAR @@ -29,6 +30,7 @@ kern_sys_item: CF_ADDTO(dynamic_attr, KRT_PREFSRC { $$ = f_new_dynamic_attr(EAF_TYPE_IP_ADDRESS, T_IP, EA_KRT_PREFSRC); }) CF_ADDTO(dynamic_attr, KRT_REALM { $$ = f_new_dynamic_attr(EAF_TYPE_INT, T_INT, EA_KRT_REALM); }) CF_ADDTO(dynamic_attr, KRT_SCOPE { $$ = f_new_dynamic_attr(EAF_TYPE_INT, T_INT, EA_KRT_SCOPE); }) +CF_ADDTO(dynamic_attr, KRT_TUNNEL { $$ = f_new_dynamic_attr(EAF_TYPE_STRING, T_STRING, EA_KRT_TUNNEL); }) CF_ADDTO(dynamic_attr, KRT_MTU { $$ = f_new_dynamic_attr(EAF_TYPE_INT, T_INT, EA_KRT_MTU); }) CF_ADDTO(dynamic_attr, KRT_WINDOW { $$ = f_new_dynamic_attr(EAF_TYPE_INT, T_INT, EA_KRT_WINDOW); }) diff --git a/sysdep/linux/netlink.c b/sysdep/linux/netlink.c index 1490213..d754df7 100644 --- a/sysdep/linux/netlink.c +++ b/sysdep/linux/netlink.c @@ -887,6 +887,7 @@ nl_send_route(struct krt_proto *p, rte *e, struct ea_list *eattrs, int op, int d struct rtmsg r; char buf[128 + KRT_METRICS_MAX*8 + nh_bufsize(a->nexthops)]; } r; + struct iface* i; DBG("nl_send_route(%I/%d,op=%x)\n", net->n.prefix, net->n.pxlen, op); @@ -962,8 +963,22 @@ dest: { case RTD_ROUTER: r.r.rtm_type = RTN_UNICAST; - nl_add_attr_u32(&r.h, sizeof(r), RTA_OIF, iface->index); - nl_add_attr_ipa(&r.h, sizeof(r), RTA_GATEWAY, gw); + if ((ea = ea_find(eattrs, EA_KRT_TUNNEL)) && + (i = if_find_by_name(ea->u.ptr->data))) + { + /* + * Tunnel attribute is set, so set the route up using the specified tunnel device + * to the originator of the route. + */ + nl_add_attr_u32(&r.h, sizeof(r), RTA_OIF, i->index); + nl_add_attr_ipa(&r.h, sizeof(r), RTA_GATEWAY, a->orig_gw); + r.r.rtm_flags |= RTNH_F_ONLINK; + } + else + { + nl_add_attr_u32(&r.h, sizeof(r), RTA_OIF, iface->index); + nl_add_attr_ipa(&r.h, sizeof(r), RTA_GATEWAY, gw); + } break; case RTD_DEVICE: r.r.rtm_type = RTN_UNICAST; @@ -1652,6 +1667,10 @@ krt_sys_get_attr(eattr *a, byte *buf, int buflen UNUSED) bsprintf(buf, "scope"); return GA_NAME; + case EA_KRT_TUNNEL: + bsprintf(buf, "tunnel"); + return GA_NAME; + case EA_KRT_LOCK: buf += bsprintf(buf, "lock:"); ea_format_bitfield(a, buf, buflen, krt_metrics_names, 2, KRT_METRICS_MAX); -- 2.7.4
From b06a7cf7b9d34201c57baaa525003a9d71a0f2ff Mon Sep 17 00:00:00 2001 From: Mike Evans <mike.ev...@metaswitch.com> Date: Thu, 4 Jun 2015 12:05:39 +0000 Subject: [PATCH 1/3] Add support for EAF_TYPE_STRING - ie. read-write string extended attributes --- filter/filter.c | 20 ++++++++++++++++++-- nest/route.h | 1 + 2 files changed, 19 insertions(+), 2 deletions(-) diff --git a/filter/filter.c b/filter/filter.c index 2f5f00d..b2f137f 100644 --- a/filter/filter.c +++ b/filter/filter.c @@ -555,6 +555,8 @@ interpret(struct f_inst *what) unsigned u1, u2; int i; u32 as; + int len; + struct adata *ad; res.type = T_VOID; if (!what) @@ -922,6 +924,10 @@ interpret(struct f_inst *what) res.type = T_INT; res.val.i = e->u.data; break; + case EAF_TYPE_STRING: + res.type = T_STRING; + res.val.s = e->u.ptr->data; + break; case EAF_TYPE_ROUTER_ID: res.type = T_QUAD; res.val.i = e->u.data; @@ -980,6 +986,16 @@ interpret(struct f_inst *what) l->attrs[0].u.data = v1.val.i; break; + case EAF_TYPE_STRING: + if (v1.type != T_STRING) + runtime( "Setting string attribute to non-string value" ); + len = strlen(v1.val.s) + 1; + ad = lp_alloc(f_pool, sizeof(struct adata) + len); + ad->length = len; + strcpy(ad->data, v1.val.s); + l->attrs[0].u.ptr = ad; + break; + case EAF_TYPE_ROUTER_ID: #ifndef IPV6 /* IP->Quad implicit conversion */ @@ -1000,8 +1016,8 @@ interpret(struct f_inst *what) case EAF_TYPE_IP_ADDRESS: if (v1.type != T_IP) runtime( "Setting ip attribute to non-ip value" ); - int len = sizeof(ip_addr); - struct adata *ad = lp_alloc(f_pool, sizeof(struct adata) + len); + len = sizeof(ip_addr); + ad = lp_alloc(f_pool, sizeof(struct adata) + len); ad->length = len; (* (ip_addr *) ad->data) = v1.val.px.ip; l->attrs[0].u.ptr = ad; diff --git a/nest/route.h b/nest/route.h index 2fcb189..2bfd066 100644 --- a/nest/route.h +++ b/nest/route.h @@ -445,6 +445,7 @@ typedef struct eattr { #define EAF_TYPE_MASK 0x0f /* Mask with this to get type */ #define EAF_TYPE_INT 0x01 /* 32-bit unsigned integer number */ #define EAF_TYPE_OPAQUE 0x02 /* Opaque byte string (not filterable) */ +#define EAF_TYPE_STRING 0x03 /* String */ #define EAF_TYPE_IP_ADDRESS 0x04 /* IP address */ #define EAF_TYPE_ROUTER_ID 0x05 /* Router ID (IPv4 address) */ #define EAF_TYPE_AS_PATH 0x06 /* BGP AS path (encoding per RFC 1771:4.3) */ -- 2.7.4