On Thu, Oct 9, 2014 at 2:53 PM, Jarno Rajahalme <jrajaha...@nicira.com> wrote: > Pravin, > > Please find my comments below. I did not snip any code to make it easier for > you to keep into context while reading this review. > Thanks for detailed review.
> Jarno > > On Oct 3, 2014, at 8:24 PM, Pravin B Shelar <pshe...@nicira.com> wrote: > >> Following patch adds support for userspace tunneling. Tunneling >> needs three more component first is routing table which is configured by >> user and second is ARP cache which build automatically by snooping arp. >> And third is tunnel protocol table which list all listening protocols >> which is populated by vswitchd as tunnel ports are added. >> >> Tunneling works as follows: >> On packet receive vswitchd check if this packet is targeted to tunnel >> port. If it is then vswitchd inserts tunnel pop action which pops >> header and sends packet to tunnel port. >> On packet xmit rather than generating Set tunnel action it generate >> tunnel push action which has tunnel header data. datapath can use >> tunnel-push action data to generate header for each packet and >> forward this packet to output port. Since tunnel-push action >> contains most of packet header it need to lookup routing table and >> arp table to build this action. >> >> Signed-off-by: Pravin B Shelar <pshe...@nicira.com> >> --- >> Makefile.am | 1 + >> README-Userspace-Tunneling | 81 ++++++++ >> datapath/linux/compat/include/linux/openvswitch.h | 10 + >> debian/openvswitch-common.docs | 1 + >> lib/automake.mk | 9 +- >> lib/dpif-netdev.c | 125 +++++++++++- >> lib/dpif.c | 25 +++ >> lib/dpif.h | 4 + >> lib/netdev-bsd.c | 3 + >> lib/netdev-dpdk.c | 3 + >> lib/netdev-dummy.c | 3 + >> lib/netdev-linux.c | 3 + >> lib/netdev-provider.h | 9 + >> lib/netdev-vport.c | 186 +++++++++++++++-- >> lib/netdev.c | 34 ++++ >> lib/netdev.h | 7 + >> lib/odp-execute.c | 2 + >> lib/odp-util.c | 118 +++++++++++ >> lib/odp-util.h | 4 + >> lib/ofpbuf.h | 8 + >> lib/packets.c | 35 ++++ >> lib/packets.h | 9 +- >> lib/tnl-arp-cache.c | 214 +++++++++++++++++++ >> lib/tnl-arp-cache.h | 47 +++++ >> lib/tnl-ports.c | 203 ++++++++++++++++++ >> lib/tnl-ports.h | 49 +++++ >> lib/tnl-router.c | 237 >> ++++++++++++++++++++++ >> lib/tnl-router.h | 42 ++++ >> lib/vlog.c | 1 + >> lib/vxlan.h | 46 +++++ >> ofproto/ofproto-dpif-xlate.c | 176 +++++++++++++++- >> ofproto/ofproto-dpif-xlate.h | 3 +- >> ofproto/ofproto-dpif.c | 38 +++- >> ofproto/tunnel.c | 79 +++++++- >> ofproto/tunnel.h | 10 +- >> rhel/openvswitch.spec.in | 2 +- >> vswitchd/ovs-vswitchd.8.in | 4 + >> vswitchd/ovs-vswitchd.c | 6 + >> 38 files changed, 1785 insertions(+), 52 deletions(-) >> create mode 100644 README-Userspace-Tunneling >> create mode 100644 lib/tnl-arp-cache.c >> create mode 100644 lib/tnl-arp-cache.h >> create mode 100644 lib/tnl-ports.c >> create mode 100644 lib/tnl-ports.h >> create mode 100644 lib/tnl-router.c >> create mode 100644 lib/tnl-router.h >> create mode 100644 lib/vxlan.h >> >> diff --git a/Makefile.am b/Makefile.am >> index 77ceec6..3941e0f 100644 >> --- a/Makefile.am >> +++ b/Makefile.am >> @@ -84,6 +84,7 @@ EXTRA_DIST = \ >> PORTING \ >> README.md \ >> README-lisp \ >> + README-Userspace-Tunneling \ >> REPORTING-BUGS \ >> TODO \ >> .travis.yml \ >> diff --git a/README-Userspace-Tunneling b/README-Userspace-Tunneling >> new file mode 100644 >> index 0000000..1aebbca >> --- /dev/null >> +++ b/README-Userspace-Tunneling >> @@ -0,0 +1,81 @@ >> + >> +Open vSwitch support tunneling in userspace. Tunneling is implemented in >> +platform independent way. But it does expect following from platform: >> +1. dpif-netdev bridge is backed by netdev should be able to >> + reply to ARP requests. > > What does it mean for a netdev to be able to reply to ARP requests? Or do you > mean that the bridge should be able to reply to ARP requests? > yes, bridge is suppose to reply to ARP, but bridge is backed by a netdev implementation the netdev should be able to reply to arp. Now I think it is too technical, I will try to simplify it. >> +2. OVS tunneling routing table is configured. >> + >> +Setup: >> +====== >> +First start vswitch with --userspace-tunneling option to enable tunneling. > > I’ll come back to this later, but I think we should treat the new tunneling > feature like any other optional datapath feature: Probe for it and (only) use > it if the probe was successful. That way it will be easier for other > datapaths to implement this and make use of it. I don’t know any details of, > e.g., Windows datapath tunneling plans, but it should be feasible to use the > same methods also with kernel datapaths. > I do not see tnl-push/pop getting implemented in other datapath any time soo, so probing can be static. >> +Setup physical bridges for app physical interfaces. create integration >> bridge. > > Would it be possible to use the new tunneling feature with a single bridge? I > see it may require recirculation, but is that really different from the case > of different physical and integration bridges? I’m also trying to figure out > if this model could fit the case of switching ASIC supporting tunneling, > where a single switch view might be preferable. > If there is no physical bridge then netdev needs to run IP routing and ARP, I wanted to use it from base OS. It is doable, Lets introduce simpler model. >> +Add vxlan port to int-bridge. Assign IP address to physical bridge where >> +vxlan traffic is expected. Add tunnel routes for the hysical bridge. > > “physical” > ok. >> + >> +Example: >> +======== >> +If you want to connect to vxlan tunnel endpoint (logical ip: 192.168.1.2 >> +and remote physical ip: 172.168.1.2) that you want to connect to. >> + >> +In this case you need to configure OVS bridges as follows. >> + >> +1. Let assume 172.168.1.2/24 network is reachable via eth1 create physical >> bridge br-eth1 >> + assign ip address (172.168.1.1/24) to br-eth1, Add eth1 to br-eth1 >> +2. Add route to ovs >> + ovs-appctl tnl/route/add 172.168.1.1 255.255.255.0 br-eth1 > > Could OVS do this step automatically based on the assigned IP address, as all > the parameters are already there? > done. > I see that if the route involves a gateway router it needs to be explicitly > added, though. > OVS already does route cache by listening to RTNL msg, we can use same. I can extract gw information from there. >> +3. Add integration brdge int-br and add tunnel port using standard syntax. >> + ovs-vsctl add-port int-br vxlan0 -- set interface vxlan0 type=vxlan >> options:remote_ip=172.168.1.2 >> +4. Assign IP address to int-br, So final topology looks like: >> + >> + >> + 192.168.1.1/24 >> + +--------------+ >> + | int-br | 192.168.1.2/24 >> + +--------------+ >> +--------------+ >> + | vxlan0 | | vxlan0 >> | >> + +--------------+ >> +--------------+ >> + | | >> + | | >> + | | >> + 172.168.1.1/24 | >> + +--------------+ | >> + | br-eth1 | 172.168.1.2/24 >> + +--------------+ >> +---------------+ >> + | eth1 |--------------------------------------| eth1 >> | >> + +--------------+ >> +---------------- >> + >> + Host A with OVS. Remote host. >> + >> +With this setup, ping to vxlan target device (192.168.1.2) should wotk > > “wotk” -> “work" > ok. >> +There are following commands that shows internal tables: >> + >> +Tunneling related commands: >> +=========================== >> +Tunnel routing table: >> + To Add route: >> + ovs-appctl tnl/route/add <IP address> <mask> <output-bridge-name> > > Does this syntax support routing via a gateway router, when the peer is not > in the same subnet? > yes, I will update doc. >> + To see all routes configured: >> + ovs-appctl tnl/route/show >> + To del route >> + ovs-appctl tnl/route/del <IP address> <mask> >> + >> +ARP: >> + To see arp cache content: >> + ovs-appctl tnl/arp/show >> + To flush arp cache >> + ovs-appctl tnl/arp/flush >> + >> +To check tunnel ports listening in vswitchd: >> +ovs-appctl tnl/ports/show >> + >> +Limits: >> +======= >> + - Does not handle IP fragments >> + - Only VxLan support >> + - Does not works correctly with vlans. >> + >> +I hope we will fix it soon. >> + >> +Contact >> +======= >> +b...@openvswitch.org >> diff --git a/datapath/linux/compat/include/linux/openvswitch.h >> b/datapath/linux/compat/include/linux/openvswitch.h >> index b2257e6..22647e9 100644 >> --- a/datapath/linux/compat/include/linux/openvswitch.h >> +++ b/datapath/linux/compat/include/linux/openvswitch.h >> @@ -580,6 +580,12 @@ struct ovs_action_hash { >> uint32_t hash_basis; >> }; >> >> +enum ovs_tunnel_action_attr { >> + OVS_ACTION_TUNNEL_HEADER_PORT, >> + OVS_ACTION_TUNNEL_HEADER_PHY_PORT, >> + OVS_ACTION_TUNNEL_HEADER_DATA, >> +}; >> + >> /** >> * enum ovs_action_attr - Action types. >> * >> @@ -633,6 +639,10 @@ enum ovs_action_attr { >> * data immediately followed by a mask. >> * The data must be zero for the unmasked >> * bits. */ >> +#ifndef __KERNEL__ >> + OVS_ACTION_ATTR_TUNNEL_PUSH = 64, >> + OVS_ACTION_ATTR_TUNNEL_POP, >> +#endif > > Managing the number allocation will become tricky if we’ll get e.g. Windows > kernel DP supporting these. > yes, that is general problem. I do not want to solve in this patch. >> __OVS_ACTION_ATTR_MAX >> }; >> >> diff --git a/debian/openvswitch-common.docs b/debian/openvswitch-common.docs >> index 3bd2ca3..9d053e2 100644 >> --- a/debian/openvswitch-common.docs >> +++ b/debian/openvswitch-common.docs >> @@ -1,2 +1,3 @@ >> FAQ >> INSTALL.DPDK >> +README-Userspace-Tunneling >> diff --git a/lib/automake.mk b/lib/automake.mk >> index c44a77f..da54204 100644 >> --- a/lib/automake.mk >> +++ b/lib/automake.mk >> @@ -231,6 +231,12 @@ lib_libopenvswitch_la_SOURCES = \ >> lib/timer.h \ >> lib/timeval.c \ >> lib/timeval.h \ >> + lib/tnl-arp-cache.c \ >> + lib/tnl-arp-cache.h \ >> + lib/tnl-ports.c \ >> + lib/tnl-ports.h \ >> + lib/tnl-router.c \ >> + lib/tnl-router.h \ >> lib/token-bucket.c \ >> lib/token-bucket.h \ >> lib/type-props.h \ >> @@ -257,7 +263,8 @@ lib_libopenvswitch_la_SOURCES = \ >> lib/vswitch-idl.c \ >> lib/vswitch-idl.h \ >> lib/vtep-idl.c \ >> - lib/vtep-idl.h >> + lib/vtep-idl.h \ >> + lib/vxlan.h >> >> if WIN32 >> lib_libopenvswitch_la_SOURCES += \ >> diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c >> index a7c9c92..81600f0 100644 >> --- a/lib/dpif-netdev.c >> +++ b/lib/dpif-netdev.c >> @@ -2791,15 +2791,83 @@ dpif_netdev_register_upcall_cb(struct dpif *dpif, >> upcall_callback *cb, >> static void >> dp_netdev_drop_packets(struct dpif_packet ** packets, int cnt, bool >> may_steal) >> { >> - int i; >> - >> if (may_steal) { >> + int i; >> + >> for (i = 0; i < cnt; i++) { >> dpif_packet_delete(packets[i]); >> } >> } >> } >> >> +static int >> +odp_push_tunnel_action(const struct dp_netdev *dp, >> + const struct nlattr *attr, >> + struct dpif_packet **packets, int cnt) >> +{ >> + struct dp_netdev_port *tun_port; >> + unsigned int left; >> + const struct nlattr *a; >> + const void *header = NULL; >> + int header_len = 0; >> + odp_port_t tunnel_oprt = 0; >> + odp_port_t phy_oprt = 0; >> + int i; >> + >> + NL_NESTED_FOR_EACH(a, left, attr) { >> + uint16_t type = nl_attr_type(a); >> + size_t len = nl_attr_get_size(a); >> + >> + switch (type) { >> + case OVS_ACTION_TUNNEL_HEADER_PORT: >> + tunnel_oprt = u32_to_odp(nl_attr_get_u32(a)); >> + break; >> + >> + case OVS_ACTION_TUNNEL_HEADER_PHY_PORT: >> + phy_oprt = u32_to_odp(nl_attr_get_u32(a)); >> + break; >> + >> + case OVS_ACTION_TUNNEL_HEADER_DATA: >> + header = nl_attr_get(a); >> + header_len = len; >> + break; >> + >> + default: >> + VLOG_ERR("Unknown tunnel push param."); >> + return -EINVAL; >> + } >> + } >> + if (!header_len || !tunnel_oprt || !phy_oprt) { >> + VLOG_ERR("Insufficient tunnel push param."); >> + return -EINVAL; >> + } >> + >> + tun_port = dp_netdev_lookup_port(dp, tunnel_oprt); >> + if (!tun_port) { >> + return -EINVAL; >> + } >> + netdev_push_header(tun_port->netdev, packets, cnt, header, header_len); >> + >> + for (i = 0; i < cnt; i++) { >> + packets[i]->md = PKT_METADATA_INITIALIZER(phy_oprt); > > I see the need to clear out the metadata, as the inner packet is now hidden > by encapsulation. However, I’m not sure if it is a good idea to overload the > ‘in_port’ member for the output tunnel port, though. > This is recirc type of action, so output port for punnel push is inport for next action execution cycle. > Also, maybe this step should be performed by netdev_push_header(), as it is > transforming the packets anyway. > ok. >> + } >> + return 0; >> +} >> + >> +static void >> +dp_netdev_clone_pkt_batch(struct dpif_packet **tnl_pkt, >> + struct dpif_packet **packets, int cnt) >> +{ >> + int i; >> + >> + for (i = 0; i < cnt; i++) { >> + struct dpif_packet *inner_pkt; >> + >> + inner_pkt = dpif_packet_clone(packets[i]); >> + tnl_pkt[i] = inner_pkt; >> + } >> +} >> + >> static void >> dp_execute_cb(void *aux_, struct dpif_packet **packets, int cnt, >> const struct nlattr *a, bool may_steal) >> @@ -2822,6 +2890,59 @@ dp_execute_cb(void *aux_, struct dpif_packet >> **packets, int cnt, >> } >> break; >> >> + case OVS_ACTION_ATTR_TUNNEL_PUSH: >> + if (*depth < MAX_RECIRC_DEPTH) { >> + struct dpif_packet *tnl_pkt[NETDEV_MAX_RX_BATCH]; >> + int err; >> + >> + if (may_steal) { >> + dp_netdev_clone_pkt_batch(tnl_pkt, packets, cnt); >> + packets = tnl_pkt; >> + } > > Should this be the reverse? Clone if can NOT take the packets? > right, >> + >> + err = odp_push_tunnel_action(dp, a, packets, cnt); >> + if (err) { >> + dp_netdev_drop_packets(tnl_pkt, cnt, may_steal); >> + break; >> + } >> + >> + (*depth)++; >> + dp_netdev_input(pmd, packets, cnt); >> + (*depth)--; >> + return; >> + } > > Should “break” here. > packets are already consumed so we can not break here. >> + >> + case OVS_ACTION_ATTR_TUNNEL_POP: >> + if (*depth >= MAX_RECIRC_DEPTH) { >> + break; >> + } >> + >> + p = dp_netdev_lookup_port(dp, u32_to_odp(nl_attr_get_u32(a))); >> + if (p) { >> + struct dpif_packet *tnl_pkt[NETDEV_MAX_RX_BATCH]; >> + int err; >> + >> + if (may_steal) { >> + dp_netdev_clone_pkt_batch(tnl_pkt, packets, cnt); >> + packets = tnl_pkt; >> + } > > Ditto. > >> + >> + err = netdev_pop_header(p->netdev, packets, cnt); >> + if (!err) { >> + >> + for (i = 0; i < cnt; i++) { >> + packets[i]->md.in_port.odp_port = >> u32_to_odp(nl_attr_get_u32(a)); > > Maybe use the same port already taken from nl_attr above? > ok. >> + } >> + >> + (*depth)++; >> + dp_netdev_input(pmd, packets, cnt); >> + (*depth)--; >> + return; >> + } >> + dp_netdev_drop_packets(tnl_pkt, cnt, may_steal); >> + } >> + break; >> + >> case OVS_ACTION_ATTR_USERSPACE: >> if (!fat_rwlock_tryrdlock(&dp->upcall_rwlock)) { >> const struct nlattr *userdata; >> diff --git a/lib/dpif.c b/lib/dpif.c >> index 34d8c13..669ca3f 100644 >> --- a/lib/dpif.c >> +++ b/lib/dpif.c >> @@ -41,6 +41,8 @@ >> #include "shash.h" >> #include "sset.h" >> #include "timeval.h" >> +#include "tnl-arp-cache.h" >> +#include "tnl-ports.h" >> #include "util.h" >> #include "valgrind.h" >> #include "vlog.h" >> @@ -72,6 +74,9 @@ struct registered_dpif_class { >> static struct shash dpif_classes = SHASH_INITIALIZER(&dpif_classes); >> static struct sset dpif_blacklist = SSET_INITIALIZER(&dpif_blacklist); >> >> +/* User configured value, by default off. */ >> +bool userspace_tunneling = false; >> + > > I’d make this a datapath feature we probe with a flow set-up. Then use the > feature whenever it is available. > >> /* Protects 'dpif_classes', including the refcount, and 'dpif_blacklist'. */ >> static struct ovs_mutex dpif_mutex = OVS_MUTEX_INITIALIZER; >> >> @@ -114,6 +119,8 @@ dp_initialize(void) >> } >> dpctl_unixctl_register(); >> ovsthread_once_done(&once); >> + tnl_port_init(); >> + tnl_arp_cache_init(); >> } >> } >> >> @@ -408,6 +415,7 @@ dpif_run(struct dpif *dpif) >> if (dpif->dpif_class->run) { >> dpif->dpif_class->run(dpif); >> } >> + tnl_arp_cache_run(); >> } >> >> /* Arranges for poll_block() to wake up when dp_run() needs to be called for >> @@ -1002,6 +1010,8 @@ dpif_execute_helper_cb(void *aux_, struct dpif_packet >> **packets, int cnt, >> >> switch ((enum ovs_action_attr)type) { >> case OVS_ACTION_ATTR_OUTPUT: >> + case OVS_ACTION_ATTR_TUNNEL_PUSH: >> + case OVS_ACTION_ATTR_TUNNEL_POP: >> case OVS_ACTION_ATTR_USERSPACE: >> case OVS_ACTION_ATTR_RECIRC: { >> struct dpif_execute execute; >> @@ -1591,3 +1601,18 @@ log_flow_get_message(const struct dpif *dpif, const >> struct dpif_flow_get *get, >> get->flow->actions, get->flow->actions_len); >> } >> } >> + >> +void dpif_enable_userspace_tunneling(void) >> +{ >> + userspace_tunneling = true; >> +} >> + >> +bool dpif_is_userspace_tunneling_on(void) >> +{ >> + return userspace_tunneling; >> +} >> + >> +bool dpif_support_userspace_tunneling(const struct dpif *dpif) >> +{ >> + return !strcmp(dpif->dpif_class->type, "netdev"); >> +} > > All these would not be necessary if ofproto-dpif probes for the new actions, > and each datapath then either supports them or not. > already replied to comment above. >> diff --git a/lib/dpif.h b/lib/dpif.h >> index f88fa78..7402a85 100644 >> --- a/lib/dpif.h >> +++ b/lib/dpif.h >> @@ -790,6 +790,10 @@ void dpif_get_netflow_ids(const struct dpif *, >> int dpif_queue_to_priority(const struct dpif *, uint32_t queue_id, >> uint32_t *priority); >> >> +void dpif_enable_userspace_tunneling(void); >> +bool dpif_is_userspace_tunneling_on(void); >> +bool dpif_support_userspace_tunneling(const struct dpif *); >> + >> #ifdef __cplusplus >> } >> #endif >> diff --git a/lib/netdev-bsd.c b/lib/netdev-bsd.c >> index 1bd9108..473c0f6 100644 >> --- a/lib/netdev-bsd.c >> +++ b/lib/netdev-bsd.c >> @@ -1562,6 +1562,9 @@ netdev_bsd_update_flags(struct netdev *netdev_, enum >> netdev_flags off, >> NULL, /* get_config */ \ >> NULL, /* set_config */ \ >> NULL, /* get_tunnel_config */ \ >> + NULL, /* build header */ \ >> + NULL, /* push header */ \ >> + NULL, /* pop header */ \ >> NULL, /* get_numa_id */ \ >> NULL, /* set_multiq */ \ >> \ >> diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c >> index 9c93768..c8610d7 100644 >> --- a/lib/netdev-dpdk.c >> +++ b/lib/netdev-dpdk.c >> @@ -1392,6 +1392,9 @@ unlock_dpdk: >> netdev_dpdk_get_config, \ >> NULL, /* netdev_dpdk_set_config */ \ >> NULL, /* get_tunnel_config */ \ >> + NULL, /* build header */ \ >> + NULL, /* push header */ \ >> + NULL, /* pop header */ \ >> netdev_dpdk_get_numa_id, /* get_numa_id */ \ >> MULTIQ, /* set_multiq */ \ >> \ >> diff --git a/lib/netdev-dummy.c b/lib/netdev-dummy.c >> index a2b1f2c..22d980a 100644 >> --- a/lib/netdev-dummy.c >> +++ b/lib/netdev-dummy.c >> @@ -1032,6 +1032,9 @@ static const struct netdev_class dummy_class = { >> netdev_dummy_get_config, >> netdev_dummy_set_config, >> NULL, /* get_tunnel_config */ >> + NULL, /* build header */ >> + NULL, /* push header */ >> + NULL, /* pop header */ >> NULL, /* get_numa_id */ >> NULL, /* set_multiq */ >> >> diff --git a/lib/netdev-linux.c b/lib/netdev-linux.c >> index ab5ef7d..e89063a 100644 >> --- a/lib/netdev-linux.c >> +++ b/lib/netdev-linux.c >> @@ -2715,6 +2715,9 @@ netdev_linux_update_flags(struct netdev *netdev_, enum >> netdev_flags off, >> NULL, /* get_config */ \ >> NULL, /* set_config */ \ >> NULL, /* get_tunnel_config */ \ >> + NULL, /* build header */ \ >> + NULL, /* push header */ \ >> + NULL, /* pop header */ \ >> NULL, /* get_numa_id */ \ >> NULL, /* set_multiq */ \ >> \ >> diff --git a/lib/netdev-provider.h b/lib/netdev-provider.h >> index 0e4cda5..50b63a6 100644 >> --- a/lib/netdev-provider.h >> +++ b/lib/netdev-provider.h >> @@ -254,6 +254,15 @@ struct netdev_class { >> const struct netdev_tunnel_config * >> (*get_tunnel_config)(const struct netdev *netdev); >> >> + void (*build_header)(const struct netdev *, struct ofpbuf *); >> + >> + int (*push_header)(const struct netdev *netdev, >> + struct dpif_packet **buffers, int cnt, >> + const void *header, int size); >> + >> + int (*pop_header)(struct netdev *netdev, >> + struct dpif_packet **buffers, int cnt); >> + >> /* Returns the id of the numa node the 'netdev' is on. If there is no >> * such info, returns NETDEV_NUMA_UNSPEC. */ >> int (*get_numa_id)(const struct netdev *netdev); >> diff --git a/lib/netdev-vport.c b/lib/netdev-vport.c >> index 83f1296..33ff49b 100644 >> --- a/lib/netdev-vport.c >> +++ b/lib/netdev-vport.c >> @@ -25,6 +25,7 @@ >> #include <sys/ioctl.h> >> >> #include "byte-order.h" >> +#include "csum.h" >> #include "daemon.h" >> #include "dirs.h" >> #include "dpif.h" >> @@ -32,13 +33,17 @@ >> #include "hmap.h" >> #include "list.h" >> #include "netdev-provider.h" >> +#include "vxlan.h" >> #include "ofpbuf.h" >> #include "packets.h" >> +#include "packet-dpif.h" >> #include "poll-loop.h" >> #include "route-table.h" >> #include "shash.h" >> #include "socket-util.h" >> #include "vlog.h" >> +#include "unaligned.h" >> +#include "util.h" >> >> VLOG_DEFINE_THIS_MODULE(netdev_vport); >> >> @@ -221,10 +226,24 @@ netdev_vport_alloc(void) >> static int >> netdev_vport_construct(struct netdev *netdev_) >> { >> - struct netdev_vport *netdev = netdev_vport_cast(netdev_); >> + struct netdev_vport *dev = netdev_vport_cast(netdev_); >> + const char *type = netdev_get_type(netdev_); >> + >> + ovs_mutex_init(&dev->mutex); >> + eth_addr_random(dev->etheraddr); >> + >> + /* Add a default destination port for tunnel ports if none specified. */ >> + if (!strcmp(type, "geneve")) { >> + dev->tnl_cfg.dst_port = htons(GENEVE_DST_PORT); >> + } >> + >> + if (!strcmp(type, "vxlan")) { >> + dev->tnl_cfg.dst_port = htons(VXLAN_DST_PORT); >> + } >> >> - ovs_mutex_init(&netdev->mutex); >> - eth_addr_random(netdev->etheraddr); >> + if (!strcmp(type, "lisp")) { >> + dev->tnl_cfg.dst_port = htons(LISP_DST_PORT); >> + } >> >> route_table_register(); >> >> @@ -761,9 +780,134 @@ get_stats(const struct netdev *netdev, struct >> netdev_stats *stats) >> >> return 0; >> } >> + >> +static bool >> +vxlan_extract_md(struct dpif_packet *dpif_pkt) >> +{ >> + struct ofpbuf *packet = &dpif_pkt->ofpbuf; >> + struct pkt_metadata *md = &dpif_pkt->md; >> + struct flow_tnl *tnl = &md->tunnel; >> + struct ip_header *nh; >> + struct udp_header *udp; >> + struct vxlanhdr *vxh; >> + >> + nh = ofpbuf_l3(packet); >> + udp = ofpbuf_l4(packet); >> + >> + memset(md, 0, sizeof(*md)); >> + >> + if (!nh || !udp) { >> + return false; >> + } >> + if (((const char *) ofpbuf_tail(packet) - (const char *) udp) < >> + (UDP_HEADER_LEN + sizeof (*vxh))) { >> + return false; >> + } >> + vxh = (struct vxlanhdr *) ((const char *)udp + sizeof(*udp)); >> + >> + if (vxh->vx_flags != htonl(VXLAN_FLAGS) || >> + (vxh->vx_vni & htonl(0xff))) { >> + VLOG_ERR("invalid vxlan flags=%#x vni=%#x\n", >> + ntohl(vxh->vx_flags), ntohl(vxh->vx_vni)); >> + return false; >> + } >> + >> + tnl->tun_id = vxh->vx_vni; >> + tnl->ip_src = get_16aligned_be32(&nh->ip_src); >> + tnl->ip_dst = get_16aligned_be32(&nh->ip_dst); >> + tnl->ip_tos = nh->ip_tos; >> + tnl->ip_ttl = nh->ip_ttl; >> + >> + ofpbuf_reset_packet(packet, ((const char *) (vxh + 1) - ((const char *) >> packet->frame))); >> + return true; >> +} >> + >> +static int >> +netdev_vxlan_pop_header(struct netdev *netdev_ OVS_UNUSED, >> + struct dpif_packet **pkt, int cnt) >> +{ >> + int i; >> + >> + for (i = 0; i < cnt; i++) { >> + vxlan_extract_md(pkt[i]); >> + } >> + return 0; >> +} >> + >> +static void >> +netdev_vxlan_build_header(const struct netdev *netdev, struct ofpbuf >> *header) >> +{ >> + struct netdev_vport *dev = netdev_vport_cast(netdev); >> + struct netdev_tunnel_config *tnl_cfg; >> + const struct eth_header *eth; >> + struct ip_header *ip; >> + struct udp_header *udp; >> + struct vxlanhdr *vxh; >> + >> + ovs_mutex_lock(&dev->mutex); >> + tnl_cfg = &dev->tnl_cfg; > > We ought to make tnl_cfg RCU to get rid of the mutex here (and likely > elsewhere as well). Just a note, I think it would be better to do this in a > separate patch anyway. > OK, I added TODO. >> + >> + eth = (const struct eth_header *) ofpbuf_data(header); >> + >> + ip = (struct ip_header *) ((const char *)eth + sizeof *eth); >> + ip->ip_proto = IPPROTO_UDP; >> + >> + udp = ofpbuf_put_zeros(header, sizeof *udp); >> + /* Set src port in push tunnel. */ >> + udp->udp_src = 44444; > > If push sets udp_src, why set it here at all? > ok. >> + udp->udp_dst = tnl_cfg->dst_port; >> + /* udp_csum is zero */ > > Would it not be the push that should be responsible for setting the checksum, > if we used one? > ok, moved to push. >> + >> + vxh = ofpbuf_put_zeros(header, sizeof *vxh); > > Could use ofpbuf_put_uninit() here instead... > I changed struct definition so it is no longer required. >> + vxh->vx_flags = htonl(VXLAN_FLAGS); >> + vxh->vx_vni = tnl_cfg->out_key & htonl(0xff); >> + > > … as we are setting all the header values here. > >> + ovs_mutex_unlock(&dev->mutex); >> +} >> + >> +static void >> +netdev_vxlan_push_header__(struct ofpbuf *packet, >> + const void *header, int size) >> +{ >> + struct eth_header *eth; >> + struct ip_header *ip; >> + struct udp_header *udp; >> + struct vxlanhdr *vxh; >> + int hdr_size; > > I’d call this “l2_size” > ok. >> + int tot_size; > > and this “ip_tot_size”, would make code easier to follow. > >> + >> + eth = ofpbuf_push_uninit(packet, size); >> + hdr_size = sizeof *ip + sizeof *udp + sizeof *vxh; > > l2_size = size - (sizeof *ip + sizeof *udp + sizeof *vxh); > ok >> + tot_size = ofpbuf_size(packet) - size + hdr_size; > > ip_tot_size = ofpbuf_size(packet) - l2_size; > ok. >> + >> + memcpy(eth, header, size); >> + ip = (struct ip_header *) ((const char *)eth + sizeof *eth); > > Use this instead, to be better compatible with VLANs: > ok. But I do not think we need to deal with VLAN here. > ip = (struct ip_header *) ((const char *)eth + l2_size); > >> + ip->ip_tot_len = htons(tot_size); >> + ip->ip_csum = csum(ip, sizeof *ip); > > Could we precompute the ip_csum for zero ip_tot_len and then update the > checksum here based on the real value? > ok. >> + >> + udp = (struct udp_header *) (ip + 1); >> + /* set udp src port*/ >> + udp->udp_len = htons(ntohs(ip->ip_tot_len) - sizeof *ip); > > IMO we should also set the UDP checksum (to zero) here. > ok. >> +} >> + >> +static int >> +netdev_vxlan_push_header(const struct netdev *netdev OVS_UNUSED, >> + struct dpif_packet **buffers, int cnt, > > ‘packets' would be a better name for the ‘buffers’ argument. > ok >> + const void *header, int size) >> +{ >> + int i; >> + >> + for (i = 0; i < cnt; i++) { >> + netdev_vxlan_push_header__(&buffers[i]->ofpbuf, header, size); >> + } >> + return 0; >> +} >> + >> >> #define VPORT_FUNCTIONS(GET_CONFIG, SET_CONFIG, \ >> - GET_TUNNEL_CONFIG, GET_STATUS) \ >> + GET_TUNNEL_CONFIG, GET_STATUS, \ >> + BUILD_HEADER, \ >> + PUSH_HEADER, POP_HEADER) \ >> NULL, \ >> netdev_vport_run, \ >> netdev_vport_wait, \ >> @@ -775,6 +919,9 @@ get_stats(const struct netdev *netdev, struct >> netdev_stats *stats) >> GET_CONFIG, \ >> SET_CONFIG, \ >> GET_TUNNEL_CONFIG, \ >> + BUILD_HEADER, \ >> + PUSH_HEADER, \ >> + POP_HEADER, \ >> NULL, /* get_numa_id */ \ >> NULL, /* set_multiq */ \ >> \ >> @@ -826,12 +973,14 @@ get_stats(const struct netdev *netdev, struct >> netdev_stats *stats) >> NULL, /* rx_wait */ \ >> NULL, /* rx_drain */ >> >> -#define TUNNEL_CLASS(NAME, DPIF_PORT) \ >> - { DPIF_PORT, \ >> - { NAME, VPORT_FUNCTIONS(get_tunnel_config, \ >> - set_tunnel_config, \ >> - get_netdev_tunnel_config, \ >> - tunnel_get_status) }} >> + >> +#define TUNNEL_CLASS(NAME, DPIF_PORT, BUILD_HEADER, PUSH_HEADER, >> POP_HEADER) \ >> + { DPIF_PORT, >> \ >> + { NAME, VPORT_FUNCTIONS(get_tunnel_config, >> \ >> + set_tunnel_config, >> \ >> + get_netdev_tunnel_config, >> \ >> + tunnel_get_status, >> \ >> + BUILD_HEADER, PUSH_HEADER, POP_HEADER) }} >> >> void >> netdev_vport_tunnel_register(void) >> @@ -839,13 +988,14 @@ netdev_vport_tunnel_register(void) >> /* The name of the dpif_port should be short enough to accomodate adding >> * a port number to the end if one is necessary. */ >> static const struct vport_class vport_classes[] = { >> - TUNNEL_CLASS("geneve", "genev_sys"), >> - TUNNEL_CLASS("gre", "gre_sys"), >> - TUNNEL_CLASS("ipsec_gre", "gre_sys"), >> - TUNNEL_CLASS("gre64", "gre64_sys"), >> - TUNNEL_CLASS("ipsec_gre64", "gre64_sys"), >> - TUNNEL_CLASS("vxlan", "vxlan_sys"), >> - TUNNEL_CLASS("lisp", "lisp_sys”) > > Add line for geneve here > right. >> + TUNNEL_CLASS("gre", "gre_sys", NULL, NULL, NULL), >> + TUNNEL_CLASS("ipsec_gre", "gre_sys", NULL, NULL, NULL), >> + TUNNEL_CLASS("gre64", "gre64_sys", NULL, NULL, NULL), >> + TUNNEL_CLASS("ipsec_gre64", "gre64_sys", NULL, NULL, NULL), >> + TUNNEL_CLASS("vxlan", "vxlan_sys", netdev_vxlan_build_header, >> + netdev_vxlan_push_header, >> + netdev_vxlan_pop_header), >> + TUNNEL_CLASS("lisp", "lisp_sys", NULL, NULL, NULL) >> }; >> static struct ovsthread_once once = OVSTHREAD_ONCE_INITIALIZER; >> >> @@ -867,6 +1017,6 @@ netdev_vport_patch_register(void) >> { "patch", VPORT_FUNCTIONS(get_patch_config, >> set_patch_config, >> NULL, >> - NULL) }}; >> + NULL, NULL, NULL, NULL) }}; >> netdev_register_provider(&patch_class.netdev_class); >> } >> diff --git a/lib/netdev.c b/lib/netdev.c >> index a9ad7d1..970fafc 100644 >> --- a/lib/netdev.c >> +++ b/lib/netdev.c >> @@ -724,6 +724,40 @@ netdev_send(struct netdev *netdev, int qid, struct >> dpif_packet **buffers, >> return error; >> } >> >> +int >> +netdev_pop_header(struct netdev *netdev, struct dpif_packet **buffers, int >> cnt) >> +{ >> + int error; >> + >> + error = (netdev->netdev_class->pop_header >> + ? netdev->netdev_class->pop_header(netdev, buffers, cnt) >> + : EOPNOTSUPP); >> + if (!error) { >> + COVERAGE_INC(netdev_sent); > > Is this coverage intentional? > nope, removed it. >> + } >> + return error; >> +} >> + >> +void >> +netdev_build_header(const struct netdev *netdev, struct ofpbuf *header) >> +{ >> + if (netdev->netdev_class->build_header) { >> + netdev->netdev_class->build_header(netdev, header); >> + } >> +} >> + >> +int >> +netdev_push_header(const struct netdev *netdev, >> + struct dpif_packet **buffers, int cnt, >> + const void *header, int size) >> +{ >> + if (netdev->netdev_class->push_header) { >> + return netdev->netdev_class->push_header(netdev, buffers, cnt, >> header, size); >> + } else { >> + return -EINVAL; >> + } >> +} >> + >> /* Registers with the poll loop to wake up from the next call to poll_block() >> * when the packet transmission queue has sufficient room to transmit a >> packet >> * with netdev_send(). >> diff --git a/lib/netdev.h b/lib/netdev.h >> index fc4180a..3af0665 100644 >> --- a/lib/netdev.h >> +++ b/lib/netdev.h >> @@ -180,6 +180,13 @@ int netdev_send(struct netdev *, int qid, struct >> dpif_packet **, int cnt, >> bool may_steal); >> void netdev_send_wait(struct netdev *, int qid); >> >> +void netdev_build_header(const struct netdev *, struct ofpbuf *); >> +int netdev_push_header(const struct netdev *netdev, >> + struct dpif_packet **buffers, int cnt, >> + const void *header, int size); >> +int netdev_pop_header(struct netdev *netdev, struct dpif_packet **buffers, >> + int cnt); >> + >> /* Hardware address. */ >> int netdev_set_etheraddr(struct netdev *, const uint8_t mac[6]); >> int netdev_get_etheraddr(const struct netdev *, uint8_t mac[6]); >> diff --git a/lib/odp-execute.c b/lib/odp-execute.c >> index b179cf2..861257c 100644 >> --- a/lib/odp-execute.c >> +++ b/lib/odp-execute.c >> @@ -412,6 +412,8 @@ odp_execute_actions(void *dp, struct dpif_packet >> **packets, int cnt, bool steal, >> switch ((enum ovs_action_attr) type) { >> /* These only make sense in the context of a datapath. */ >> case OVS_ACTION_ATTR_OUTPUT: >> + case OVS_ACTION_ATTR_TUNNEL_PUSH: >> + case OVS_ACTION_ATTR_TUNNEL_POP: >> case OVS_ACTION_ATTR_USERSPACE: >> case OVS_ACTION_ATTR_RECIRC: >> if (dp_execute_action) { >> diff --git a/lib/odp-util.c b/lib/odp-util.c >> index b9e1e21..20d6146 100644 >> --- a/lib/odp-util.c >> +++ b/lib/odp-util.c >> @@ -30,10 +30,12 @@ >> #include "dynamic-string.h" >> #include "flow.h" >> #include "netlink.h" >> +#include "vxlan.h" >> #include "ofpbuf.h" >> #include "packets.h" >> #include "simap.h" >> #include "timeval.h" >> +#include "unaligned.h" >> #include "util.h" >> #include "vlog.h" >> >> @@ -74,6 +76,8 @@ odp_action_len(uint16_t type) >> >> switch ((enum ovs_action_attr) type) { >> case OVS_ACTION_ATTR_OUTPUT: return sizeof(uint32_t); >> + case OVS_ACTION_ATTR_TUNNEL_PUSH: return -2; >> + case OVS_ACTION_ATTR_TUNNEL_POP: return sizeof(uint32_t); >> case OVS_ACTION_ATTR_USERSPACE: return -2; >> case OVS_ACTION_ATTR_PUSH_VLAN: return sizeof(struct >> ovs_action_push_vlan); >> case OVS_ACTION_ATTR_POP_VLAN: return 0; >> @@ -507,6 +511,86 @@ format_odp_hash_action(struct ds *ds, const struct >> ovs_action_hash *hash_act) >> } >> >> static void >> +format_odp_tunnel_vxlan_header(struct ds *ds, const struct nlattr *a) >> +{ >> + const struct eth_header *eth; >> + const struct ip_header *ip; >> + const struct udp_header *udp; >> + const struct vxlanhdr *vxh; >> + >> + /* TODO: handle other tunnel data. */ >> + eth = nl_attr_get(a); >> + ip = (const struct ip_header *) ((const char *)eth + sizeof(*eth)); > > Need to parse VLANs here? > I do not think tunnel header will have vlan set. >> + udp = (const struct udp_header *) ((const char *)ip + sizeof(*ip)); >> + vxh = (const struct vxlanhdr *) ((const char *)udp + sizeof(*udp)); >> + >> + /* Ehernet */ > > “Ethernet” > ok. >> + ds_put_format(ds, "header(eth(src="); >> + ds_put_format(ds, ETH_ADDR_FMT, ETH_ADDR_ARGS(eth->eth_src)); >> + ds_put_format(ds, ",dst="); >> + ds_put_format(ds, ETH_ADDR_FMT, ETH_ADDR_ARGS(eth->eth_dst)); >> + ds_put_format(ds, ",dl_type=0x%04"PRIx16"),", ntohs(eth->eth_type)); >> + >> + /* IPv4 */ >> + ds_put_format(ds, "ipv4(src="IP_FMT",dst="IP_FMT",proto=%"PRIu8 >> + ",tos=%#"PRIx8",ttl=%"PRIu8",frag=%"PRIx16"),", >> + IP_ARGS(get_16aligned_be32(&ip->ip_src)), >> + IP_ARGS(get_16aligned_be32(&ip->ip_dst)), >> + ip->ip_proto, ip->ip_tos, >> + ip->ip_ttl, >> + ip->ip_frag_off); >> + >> + /* UDP */ >> + ds_put_format(ds, "udp(src=%"PRIu16",dst=%"PRIu16"),", >> + ntohs(udp->udp_src), ntohs(udp->udp_dst)); >> + >> + /* VxLan */ >> + ds_put_format(ds, "vxlan(flags=%"PRIx32",vni=%"PRIx32")", >> + vxh->vx_flags, vxh->vx_vni); >> + >> + ds_put_format(ds, ")"); >> +} >> + >> +static void >> +format_odp_tunnel_push_action(struct ds *ds, const struct nlattr *attrs, >> + size_t len) >> +{ >> + enum ovs_tunnel_action_attr type; >> + const struct nlattr *a; >> + unsigned int left; >> + >> + ds_put_format(ds, "tunnel_push("); >> + NL_ATTR_FOR_EACH (a, left, attrs, len) { >> + if (a != attrs) { >> + ds_put_cstr(ds, ","); >> + } >> + >> + type = nl_attr_type(a); >> + switch(type) { >> + case OVS_ACTION_TUNNEL_HEADER_PORT: >> + ds_put_format(ds, "tnl_port(%"PRIu32, nl_attr_get_u32(a)); >> + ds_put_cstr(ds, ")"); >> + break; >> + >> + case OVS_ACTION_TUNNEL_HEADER_PHY_PORT: >> + ds_put_format(ds, "phy_port(%"PRIu32, nl_attr_get_u32(a)); >> + ds_put_cstr(ds, ")"); >> + break; >> + >> + case OVS_ACTION_TUNNEL_HEADER_DATA: >> + format_odp_tunnel_vxlan_header(ds, a); >> + break; >> + >> + default: >> + ds_put_cstr(ds, "<error>"); >> + break; >> + } >> + } >> + ds_put_format(ds, ")"); >> + >> +} >> + >> +static void >> format_odp_action(struct ds *ds, const struct nlattr *a) >> { >> int expected_len; >> @@ -526,6 +610,13 @@ format_odp_action(struct ds *ds, const struct nlattr *a) >> case OVS_ACTION_ATTR_OUTPUT: >> ds_put_format(ds, "%"PRIu32, nl_attr_get_u32(a)); >> break; >> + case OVS_ACTION_ATTR_TUNNEL_POP: >> + ds_put_format(ds, "tnl_pop(%"PRIu32, nl_attr_get_u32(a)); >> + ds_put_cstr(ds, ")"); >> + break; >> + case OVS_ACTION_ATTR_TUNNEL_PUSH: >> + format_odp_tunnel_push_action(ds, nl_attr_get(a), >> nl_attr_get_size(a)); >> + break; >> case OVS_ACTION_ATTR_USERSPACE: >> format_odp_userspace_action(ds, a); >> break; >> @@ -748,6 +839,16 @@ parse_odp_action(const char *s, const struct simap >> *port_names, >> uint32_t port; >> int n; >> >> + if (ovs_scan(s, "tnl_queue(%"SCNi32")%n", &port, &n)) { >> + nl_msg_put_u32(actions, OVS_ACTION_ATTR_OUTPUT, port); >> + return n; >> + } > > What is this? > it should be parsing tnl_pop action. I fixed it. >> + } >> + >> + { >> + uint32_t port; >> + int n; >> + >> if (ovs_scan(s, "%"SCNi32"%n", &port, &n)) { >> nl_msg_put_u32(actions, OVS_ACTION_ATTR_OUTPUT, port); >> return n; >> @@ -3543,6 +3644,23 @@ odp_put_tunnel_action(const struct flow_tnl *tunnel, >> tun_key_to_attr(odp_actions, tunnel); >> nl_msg_end_nested(odp_actions, offset); >> } >> + >> +void >> +odp_put_build_tunnel_header_action(struct ofpbuf *odp_actions, struct >> ofpbuf *header, >> + odp_port_t tunnel_port, odp_port_t >> phy_port) > > Should be called “odp_put_tunnel_push_action” instead? > ok. >> +{ >> + size_t offset = nl_msg_start_nested(odp_actions, >> OVS_ACTION_ATTR_TUNNEL_PUSH); >> + >> + nl_msg_put_odp_port(odp_actions, OVS_ACTION_TUNNEL_HEADER_PORT, >> + tunnel_port); >> + nl_msg_put_unspec(odp_actions, OVS_ACTION_TUNNEL_HEADER_DATA, >> + ofpbuf_data(header), ofpbuf_size(header)); >> + nl_msg_put_odp_port(odp_actions, OVS_ACTION_TUNNEL_HEADER_PHY_PORT, >> + phy_port); >> + >> + nl_msg_end_nested(odp_actions, offset); >> +} >> + >> >> /* The commit_odp_actions() function and its helpers. */ >> >> diff --git a/lib/odp-util.h b/lib/odp-util.h >> index 11b54dd..64e7dc3 100644 >> --- a/lib/odp-util.h >> +++ b/lib/odp-util.h >> @@ -253,5 +253,9 @@ size_t odp_put_userspace_action(uint32_t pid, >> struct ofpbuf *odp_actions); >> void odp_put_tunnel_action(const struct flow_tnl *tunnel, >> struct ofpbuf *odp_actions); >> +void odp_put_build_tunnel_header_action(struct ofpbuf *odp_actions, >> + struct ofpbuf *header, >> + odp_port_t tunnel_odp_port, >> + odp_port_t phy_port); >> >> #endif /* odp-util.h */ >> diff --git a/lib/ofpbuf.h b/lib/ofpbuf.h >> index adaf526..72c254f 100644 >> --- a/lib/ofpbuf.h >> +++ b/lib/ofpbuf.h >> @@ -422,6 +422,14 @@ static inline void ofpbuf_set_size(struct ofpbuf *b, >> uint32_t v) >> } >> #endif >> >> +static inline void ofpbuf_reset_packet(struct ofpbuf *b, int off) >> +{ >> + ofpbuf_set_size(b, ofpbuf_size(b) - off); >> + ofpbuf_set_data(b, (void *) ((unsigned char *) b->frame + off)); >> + b->frame = NULL; >> + b->l2_5_ofs = b->l3_ofs = b->l4_ofs = UINT16_MAX; >> +} >> + >> #ifdef __cplusplus >> } >> #endif >> diff --git a/lib/packets.c b/lib/packets.c >> index 65d8109..a0cfadc 100644 >> --- a/lib/packets.c >> +++ b/lib/packets.c >> @@ -956,3 +956,38 @@ packet_format_tcp_flags(struct ds *s, uint16_t >> tcp_flags) >> ds_put_cstr(s, "[800]"); >> } >> } >> + >> +#define ARP_PACKET_SIZE (2 + ETH_HEADER_LEN + VLAN_HEADER_LEN + \ >> + ARP_ETH_HEADER_LEN) >> + >> +void >> +compose_arp(struct ofpbuf *b, const uint8_t eth_src[ETH_ADDR_LEN], >> + ovs_be32 ip_src, ovs_be32 ip_dst) >> +{ >> + struct eth_header *eth; >> + struct arp_eth_header *arp; >> + >> + ofpbuf_clear(b); >> + ofpbuf_prealloc_tailroom(b, ARP_PACKET_SIZE); >> + ofpbuf_reserve(b, 2 + VLAN_HEADER_LEN); >> + >> + eth = ofpbuf_put_uninit(b, sizeof *eth); >> + memcpy(eth->eth_dst, eth_addr_broadcast, ETH_ADDR_LEN); >> + memcpy(eth->eth_src, eth_src, ETH_ADDR_LEN); >> + eth->eth_type = htons(ETH_TYPE_ARP); >> + >> + arp = ofpbuf_put_uninit(b, sizeof *arp); >> + arp->ar_hrd = htons(ARP_HRD_ETHERNET); >> + arp->ar_pro = htons(ARP_PRO_IP); >> + arp->ar_hln = sizeof arp->ar_sha; >> + arp->ar_pln = sizeof arp->ar_spa; >> + arp->ar_op = htons(ARP_OP_REQUEST); >> + memcpy(arp->ar_sha, eth_src, ETH_ADDR_LEN); >> + memcpy(arp->ar_tha, eth_src, ETH_ADDR_LEN); > > I think the tha should be set to all-zeroes instead. RFC 826 says: > > “It does not set ar$tha to anything in particular, because it is this value > that it is trying to determine.” > ok. >> + >> + put_16aligned_be32(&arp->ar_spa, ip_src); >> + put_16aligned_be32(&arp->ar_tpa, ip_dst); >> + >> + ofpbuf_set_frame(b, eth); >> + ofpbuf_set_l3(b, arp); >> +} >> diff --git a/lib/packets.h b/lib/packets.h >> index 26c6ff1..b24015e 100644 >> --- a/lib/packets.h >> +++ b/lib/packets.h >> @@ -486,6 +486,7 @@ void ip_format_masked(ovs_be32 ip, ovs_be32 mask, struct >> ds *); >> ((ip_frag_off) & htons(IP_MORE_FRAGMENTS | IP_FRAG_OFF_MASK)) >> >> #define IP_HEADER_LEN 20 >> +OVS_PACKED( >> struct ip_header { >> uint8_t ip_ihl_ver; >> uint8_t ip_tos; >> @@ -497,7 +498,8 @@ struct ip_header { >> ovs_be16 ip_csum; >> ovs_16aligned_be32 ip_src; >> ovs_16aligned_be32 ip_dst; >> -}; >> +}); >> + > > I’d rather not pack this, if possible. > ok. >> BUILD_ASSERT_DECL(IP_HEADER_LEN == sizeof(struct ip_header)); >> >> #define ICMP_HEADER_LEN 8 >> @@ -545,12 +547,13 @@ struct sctp_header { >> BUILD_ASSERT_DECL(SCTP_HEADER_LEN == sizeof(struct sctp_header)); >> >> #define UDP_HEADER_LEN 8 >> +OVS_PACKED( >> struct udp_header { >> ovs_be16 udp_src; >> ovs_be16 udp_dst; >> ovs_be16 udp_len; >> ovs_be16 udp_csum; >> -}; >> +}); > > Same here. > >> BUILD_ASSERT_DECL(UDP_HEADER_LEN == sizeof(struct udp_header)); >> >> #define TCP_FIN 0x001 >> @@ -735,5 +738,7 @@ void packet_set_sctp_port(struct ofpbuf *, ovs_be16 src, >> ovs_be16 dst); >> >> void packet_format_tcp_flags(struct ds *, uint16_t); >> const char *packet_tcp_flag_to_string(uint32_t flag); >> +void compose_arp(struct ofpbuf *b, const uint8_t eth_src[ETH_ADDR_LEN], >> + ovs_be32 ip_src, ovs_be32 ip_dst); >> >> #endif /* packets.h */ >> diff --git a/lib/tnl-arp-cache.c b/lib/tnl-arp-cache.c >> new file mode 100644 >> index 0000000..2327452 >> --- /dev/null >> +++ b/lib/tnl-arp-cache.c >> @@ -0,0 +1,214 @@ >> +/* >> + * Copyright (c) 2014 Nicira, Inc. >> + * >> + * Licensed under the Apache License, Version 2.0 (the "License"); >> + * you may not use this file except in compliance with the License. >> + * You may obtain a copy of the License at: >> + * >> + * http://www.apache.org/licenses/LICENSE-2.0 >> + * >> + * Unless required by applicable law or agreed to in writing, software >> + * distributed under the License is distributed on an "AS IS" BASIS, >> + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. >> + * See the License for the specific language governing permissions and >> + * limitations under the License. >> + */ >> + >> +#include <config.h> >> +#include <inttypes.h> >> +#include <stdlib.h> >> + >> +#include "bitmap.h" >> +#include "coverage.h" >> +#include "dpif-netdev.h" >> +#include "dynamic-string.h" >> +#include "errno.h" >> +#include "flow.h" >> +#include "hash.h" >> +#include "hmap.h" >> +#include "list.h" >> +#include "netdev.h" >> +#include "ovs-thread.h" >> +#include "packets.h" >> +#include "packet-dpif.h" >> +#include "poll-loop.h" >> +#include "timeval.h" >> +#include "tnl-arp-cache.h" >> +#include "unaligned.h" >> +#include "unixctl.h" >> +#include "util.h" >> +#include "vlog.h" >> + >> +VLOG_DEFINE_THIS_MODULE(arp); >> + >> +#define ARP_ENTRY_DEFAULT_IDLE_TIME 30000 >> + >> +struct tnl_arp_entry { >> + struct hmap_node hmap_node; > > IMO it would be better to use cmap, that way the lookups could be lockless. > ok, I replaced it with cmap. >> + ovs_be32 ip; >> + uint8_t mac[ETH_ADDR_LEN]; >> + time_t expires; /* Expiration time. */ >> + char br_name[IFNAMSIZ]; >> +}; >> + >> +static struct hmap table = HMAP_INITIALIZER(&table); >> +static struct ovs_rwlock rwlock = OVS_RWLOCK_INITIALIZER; > > Would still need a mutex for excluding concurrent modifications. > ok. >> +static uint32_t secret; >> + >> +static uint32_t >> +tnl_arp_table_hash(ovs_be32 ip4) >> +{ >> + return hash_2words(ntohl(ip4), secret); >> +} >> + > > cmap is already internally rehashing the given hash, so it would be possible > to just give the dst address as the ‘hash’ to cmap functions. > ok. >> +static struct tnl_arp_entry * >> +__tnl_arp_lookup(const char br_name[IFNAMSIZ], ovs_be32 dst) >> +{ >> + struct tnl_arp_entry *arp; >> + uint32_t hash; >> + >> + hash = tnl_arp_table_hash(dst); >> + HMAP_FOR_EACH_WITH_HASH (arp, hmap_node, hash, &table) { >> + if (arp->ip == dst && !strncmp(arp->br_name, br_name, IFNAMSIZ)) { >> + return arp; >> + } >> + } >> + return NULL; >> +} >> + >> +int >> +tnl_arp_lookup(const char br_name[IFNAMSIZ], ovs_be32 dst, uint8_t >> mac[ETH_ADDR_LEN]) >> +{ >> + struct tnl_arp_entry *arp; >> + int res = -ENOENT; >> + >> + ovs_rwlock_rdlock(&rwlock); >> + VLOG_ERR("pbs: %s port %s\n",__func__,br_name); > > Logging should be conditional or removed. > removed it. >> + arp = __tnl_arp_lookup(br_name, dst); >> + if (arp) { >> + memcpy(mac, arp->mac, ETH_ADDR_LEN); >> + res = 0; >> + } >> + ovs_rwlock_unlock(&rwlock); >> + >> + return res; >> +} >> + >> +static void >> +tnl_arp_delete(struct tnl_arp_entry *arp) >> +{ >> + hmap_remove(&table, &arp->hmap_node); >> + free(arp); >> +} >> + >> +int >> +tnl_arp_snoop(const struct flow *flow, struct flow_wildcards *wc, >> + const char name[IFNAMSIZ]) >> +{ >> + struct tnl_arp_entry *arp; >> + uint32_t hash; >> + >> + if (flow->dl_type != htons(ETH_TYPE_ARP)) { >> + return -EINVAL; >> + } >> + >> + memset(&wc->masks.dl_type, 0xff, sizeof wc->masks.dl_type); >> + memset(&wc->masks.nw_proto, 0xff, sizeof wc->masks.nw_proto); >> + memset(&wc->masks.nw_src, 0xff, sizeof wc->masks.nw_src); >> + memset(&wc->masks.nw_dst, 0xff, sizeof wc->masks.nw_dst); >> + memset(&wc->masks.arp_sha, 0xff, sizeof wc->masks.arp_sha); >> + memset(&wc->masks.arp_tha, 0xff, sizeof wc->masks.arp_tha); >> + >> + ovs_rwlock_wrlock(&rwlock); >> + arp = __tnl_arp_lookup(name, flow->nw_src); >> + if (arp) { >> + if (!memcmp(arp->mac, flow->arp_sha, ETH_ADDR_LEN)) { >> + arp->expires = time_now() + ARP_ENTRY_DEFAULT_IDLE_TIME; >> + ovs_rwlock_unlock(&rwlock); >> + return 0; >> + } >> + tnl_arp_delete(arp); >> + } >> + >> + arp = xmalloc(sizeof *arp); >> + >> + arp->ip = flow->nw_src; >> + memcpy(arp->mac, flow->arp_sha, ETH_ADDR_LEN); >> + arp->expires = time_now() + ARP_ENTRY_DEFAULT_IDLE_TIME; >> + strncpy(arp->br_name, name, IFNAMSIZ); >> + hash = tnl_arp_table_hash(arp->ip); >> + hmap_insert(&table, &arp->hmap_node, hash); >> + ovs_rwlock_unlock(&rwlock); >> + return 0; >> +} >> + >> +void >> +tnl_arp_cache_run(void) >> +{ >> + struct tnl_arp_entry *arp; >> + >> + ovs_rwlock_wrlock(&rwlock); >> + HMAP_FOR_EACH(arp, hmap_node, &table) { >> + if (arp->expires <= time_now()) { >> + tnl_arp_delete(arp); >> + } >> + } >> + ovs_rwlock_unlock(&rwlock); >> +} >> + >> +static void >> +tnl_arp_cache_flush(struct unixctl_conn *conn OVS_UNUSED, int argc >> OVS_UNUSED, >> + const char *argv[] OVS_UNUSED, void *aux OVS_UNUSED) >> +{ >> + struct tnl_arp_entry *arp, *temp; >> + >> + ovs_rwlock_wrlock(&rwlock); >> + HMAP_FOR_EACH_SAFE(arp, temp, hmap_node, &table) { >> + tnl_arp_delete(arp); >> + } >> + ovs_rwlock_unlock(&rwlock); >> + unixctl_command_reply(conn, "OK"); >> +} >> + >> +#define MAX_IP_ADDR_LEN 16 >> + >> +static void >> +tnl_arp_cache_show(struct unixctl_conn *conn, int argc OVS_UNUSED, >> + const char *argv[] OVS_UNUSED, void *aux OVS_UNUSED) >> +{ >> + struct ds ds = DS_EMPTY_INITIALIZER; >> + struct tnl_arp_entry *arp; >> + >> + ds_put_cstr(&ds, "IP MAC Bridge\n"); >> + ds_put_cstr(&ds, "=============================================\n"); >> + ovs_rwlock_rdlock(&rwlock); >> + HMAP_FOR_EACH(arp, hmap_node, &table) { >> + int start_len, need_ws; >> + >> + start_len = ds.length; >> + ds_put_format(&ds, IP_FMT, IP_ARGS(arp->ip)); >> + >> + need_ws = MAX_IP_ADDR_LEN - (ds.length - start_len); >> + >> + while (need_ws >= 0) { >> + ds_put_format(&ds, " "); >> + need_ws--; >> + } >> + >> + ds_put_format(&ds, ETH_ADDR_FMT" %s\n", >> + ETH_ADDR_ARGS(arp->mac), arp->br_name); >> + >> + } >> + ovs_rwlock_unlock(&rwlock); >> + unixctl_command_reply(conn, ds_cstr(&ds)); >> + ds_destroy(&ds); >> +} >> + >> +void >> +tnl_arp_cache_init(void) >> +{ >> + /* XXX: Add documentation for these commands. */ >> + unixctl_command_register("tnl/arp/show", "", 0, 0, tnl_arp_cache_show, >> NULL); >> + unixctl_command_register("tnl/arp/flush", "", 0, 0, >> tnl_arp_cache_flush, NULL); >> + secret = random_uint32(); >> +} >> diff --git a/lib/tnl-arp-cache.h b/lib/tnl-arp-cache.h >> new file mode 100644 >> index 0000000..f68c3b4 >> --- /dev/null >> +++ b/lib/tnl-arp-cache.h >> @@ -0,0 +1,47 @@ >> +/* >> + * Copyright (c) 2014 Nicira, Inc. >> + * >> + * Licensed under the Apache License, Version 2.0 (the "License"); >> + * you may not use this file except in compliance with the License. >> + * You may obtain a copy of the License at: >> + * >> + * http://www.apache.org/licenses/LICENSE-2.0 >> + * >> + * Unless required by applicable law or agreed to in writing, software >> + * distributed under the License is distributed on an "AS IS" BASIS, >> + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. >> + * See the License for the specific language governing permissions and >> + * limitations under the License. >> + */ >> + >> +#ifndef OVS_TNL_ARP_CACHE_H >> +#define OVS_TNL_ARP_CACHE_H 1 >> + >> +#include <errno.h> >> +#include <inttypes.h> >> +#include <stddef.h> >> +#include <stdint.h> >> +#include <string.h> >> +#include <net/if.h> >> +#include <sys/socket.h> >> + >> +#include "flow.h" >> +#include "netdev.h" >> +#include "packets.h" >> +#include "util.h" >> + >> +#ifdef __cplusplus >> +extern "C" { >> +#endif >> + >> +int tnl_arp_snoop(const struct flow *flow, struct flow_wildcards *wc, >> + const char dev_name[]); >> +int tnl_arp_lookup(const char dev_name[], ovs_be32 dst, uint8_t >> mac[ETH_ADDR_LEN]); >> +void tnl_arp_cache_init(void); >> +void tnl_arp_cache_run(void); >> + >> +#ifdef __cplusplus >> +} >> +#endif >> + >> +#endif >> diff --git a/lib/tnl-ports.c b/lib/tnl-ports.c >> new file mode 100644 >> index 0000000..4dbda7d >> --- /dev/null >> +++ b/lib/tnl-ports.c >> @@ -0,0 +1,203 @@ >> +/* >> + * Copyright (c) 2014 Nicira, Inc. >> + * >> + * Licensed under the Apache License, Version 2.0 (the "License"); >> + * you may not use this file except in compliance with the License. >> + * You may obtain a copy of the License at: >> + * >> + * http://www.apache.org/licenses/LICENSE-2.0 >> + * >> + * Unless required by applicable law or agreed to in writing, software >> + * distributed under the License is distributed on an "AS IS" BASIS, >> + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. >> + * See the License for the specific language governing permissions and >> + * limitations under the License. >> + */ >> + >> +#include <config.h> >> +#include <stddef.h> >> +#include <stdint.h> >> + >> +#include "bitmap.h" >> +#include "classifier.h" >> +#include "coverage.h" >> +#include "dpif-netdev.h" >> +#include "fat-rwlock.h" >> +#include "dynamic-string.h" >> +#include "errno.h" >> +#include "flow.h" >> +#include "hash.h" >> +#include "list.h" >> +#include "netdev.h" >> +#include "ovs-thread.h" >> +#include "odp-util.h" >> +#include "packet-dpif.h" >> +#include "packets.h" >> +#include "poll-loop.h" >> +#include "timeval.h" >> +#include "tnl-arp-cache.h" >> +#include "tnl-ports.h" >> +#include "tnl-ports.h" >> +#include "unaligned.h" >> +#include "unixctl.h" >> +#include "util.h" >> + >> +static struct classifier cls; >> +static struct ovs_mutex tunnel_mutex = OVS_MUTEX_INITIALIZER; >> + >> +struct tunnel_ports { >> + struct cls_rule cr; >> + odp_port_t portno; >> + struct ovs_refcount ref_cnt; >> + char dev_name[IFNAMSIZ]; >> +}; > > Why plural name for the struct? > fixed. >> + >> +static struct tunnel_ports * >> +tnl_port_cast(const struct cls_rule *cr) >> +{ > > if (offsetof(struct tunnel_ports, cr) == 0) { > return (struct tunnel_ports *)cr; > } > >> + return cr ? CONTAINER_OF(cr, struct tunnel_ports, cr) : NULL; >> +} >> + >> +void >> +tnl_port_insert_udp(odp_port_t port, ovs_be32 ip_dst, ovs_be16 udp_port, >> + const char dev_name[]) >> +{ >> + const struct cls_rule *cr; >> + struct tunnel_ports *p; >> + struct flow_wildcards wc; >> + struct match match; >> + struct flow flow; >> + >> + memset(&flow, 0, sizeof flow); >> + >> + flow.dl_type = htons(ETH_TYPE_IP); >> + flow.nw_proto = IPPROTO_UDP; >> + flow.tp_dst = udp_port; >> + flow.nw_src = ip_dst; >> + > > Please document the weirdness of setting flow.nw_src to ip_dst. > ok. >> + cr = classifier_lookup(&cls, &flow, NULL); >> + if (cr) { >> + struct tunnel_ports *p; >> + >> + p = tnl_port_cast(cr); >> + >> + ovs_refcount_ref(&p->ref_cnt); >> + return; >> + } > > > You should loop with ovs_refcount_try_ref_rcu() instead, if it is possible > that a parallel thread is removing the same entry at the same time, like this: > ok. > do { > cr = classifier_lookup(&cls, &flow, NULL); > > p = tnl_port_cast(cr); > > /* Try again if the rule was released before we get the reference. */ > } while (p && !ovs_refcount_try_ref_rcu(&p->ref_cnt)); > > if (p) { > return; > } > > Otherwise you could end up calling ovs_refcount_ref() when the count has > already reached zero, which will assert fail. > ok. >> + >> + p = xzalloc(sizeof *p); >> + p->portno = port; >> + >> + memset(&wc, 0, sizeof(wc)); >> + wc.masks.dl_type = 0xffff; > > need htons() here. > >> + wc.masks.nw_proto = 0xff; >> + wc.masks.nw_frag = 0xff; >> + wc.masks.tp_dst = 0xffff; > > and here. > >> + wc.masks.nw_src = 0xffffffff; > > and htonl() here. > >> + >> + /* Need to check IP address. */ >> + match_init(&match, &flow, &wc); > > It would be a bit more efficient to use 'match.flow' and 'match.wc' directly > instead of ‘flow' and ‘wc’, and then calling match_init(). > >> + cls_rule_init(&p->cr, &match, 1); >> + ovs_refcount_init(&p->ref_cnt); >> + strncpy(p->dev_name, dev_name, IFNAMSIZ); >> + >> + ovs_mutex_lock(&tunnel_mutex); >> + classifier_insert(&cls, &p->cr); >> + ovs_mutex_unlock(&tunnel_mutex); > > Classifier has it’s own internal mutex that serializes mutations, so the > locking here is not necessary. > ok. >> +} >> + >> +static void >> +tnl_port_unref(struct cls_rule *cr) >> +{ >> + if (cr) { >> + struct tunnel_ports *p; >> + >> + p = tnl_port_cast(cr); >> + if (ovs_refcount_unref_relaxed(&p->ref_cnt) == 1) { >> + classifier_remove(&cls, cr); >> + } >> + } >> +} >> + >> +void >> +tnl_port_delete(ovs_be32 ip_dst, ovs_be16 udp_port) >> +{ >> + struct cls_rule *cr; >> + struct flow flow; >> + >> + memset(&flow, 0, sizeof flow); >> + flow.dl_type = htons(ETH_TYPE_IP); >> + flow.nw_proto = IPPROTO_UDP; >> + flow.tp_dst = udp_port; >> + flow.nw_src = ip_dst; >> + >> + ovs_mutex_lock(&tunnel_mutex); >> + cr = classifier_lookup(&cls, &flow, NULL); >> + tnl_port_unref(cr); >> + ovs_mutex_unlock(&tunnel_mutex); > > The refcount already protects against multiple deletes, so the mutex > protection here is not necessary, > ok. >> +} >> + >> +odp_port_t >> +tnl_port_lookup(const struct flow *flow, struct flow_wildcards *wc) >> +{ >> + const struct cls_rule *cr; >> + const struct tunnel_ports *p; >> + >> + /* Un wildcard protocol and stuff */ >> + cr = classifier_lookup(&cls, flow, wc); >> + if (!cr) { >> + return ODPP_NONE; >> + } >> + p = tnl_port_cast(cr); >> + return p->portno; >> +} >> + >> +static void >> +tnl_port_show(struct unixctl_conn *conn, int argc OVS_UNUSED, >> + const char *argv[] OVS_UNUSED, void *aux OVS_UNUSED) >> +{ >> + struct ds ds = DS_EMPTY_INITIALIZER; >> + const struct tunnel_ports *p; >> + >> + ovs_mutex_lock(&tunnel_mutex); > > Classifier internally prevents mutations during iteration, so the mutex here > is not necessary. > ok. >> + CLS_FOR_EACH(p, cr, &cls) { >> + struct odputil_keybuf keybuf; >> + struct odputil_keybuf maskbuf; >> + struct flow flow; >> + const struct nlattr *key, *mask; >> + size_t key_len, mask_len; >> + struct flow_wildcards wc; >> + struct ofpbuf buf; >> + >> + ds_put_format(&ds, "%s: ", p->dev_name); >> + minimask_expand(&p->cr.match.mask, &wc); >> + miniflow_expand(&p->cr.match.flow, &flow); >> + >> + /* Key. */ >> + ofpbuf_use_stack(&buf, &keybuf, sizeof keybuf); >> + odp_flow_key_from_flow(&buf, &flow, &wc.masks, >> + flow.in_port.odp_port, true); >> + key = ofpbuf_data(&buf); >> + key_len = ofpbuf_size(&buf); >> + /* mask*/ >> + ofpbuf_use_stack(&buf, &maskbuf, sizeof maskbuf); >> + odp_flow_key_from_mask(&buf, &wc.masks, &flow, >> + odp_to_u32(wc.masks.in_port.odp_port), >> + SIZE_MAX, false); >> + mask = ofpbuf_data(&buf); >> + mask_len = ofpbuf_size(&buf); >> + >> + /* build string. */ >> + odp_flow_format(key, key_len, mask, mask_len, NULL, &ds, false); >> + } >> + ovs_mutex_unlock(&tunnel_mutex); >> + unixctl_command_reply(conn, ds_cstr(&ds)); >> + ds_destroy(&ds); >> +} >> + >> +void >> +tnl_port_init(void) >> +{ >> + classifier_init(&cls, flow_segment_u32s); >> + unixctl_command_register("tnl/ports/show", "", 0, 0, tnl_port_show, >> NULL); >> +} >> diff --git a/lib/tnl-ports.h b/lib/tnl-ports.h >> new file mode 100644 >> index 0000000..73355e3 >> --- /dev/null >> +++ b/lib/tnl-ports.h >> @@ -0,0 +1,49 @@ >> +/* >> + * Copyright (c) 2014 Nicira, Inc. >> + * >> + * Licensed under the Apache License, Version 2.0 (the "License"); >> + * you may not use this file except in compliance with the License. >> + * You may obtain a copy of the License at: >> + * >> + * http://www.apache.org/licenses/LICENSE-2.0 >> + * >> + * Unless required by applicable law or agreed to in writing, software >> + * distributed under the License is distributed on an "AS IS" BASIS, >> + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. >> + * See the License for the specific language governing permissions and >> + * limitations under the License. >> + */ >> + >> +#ifndef OVS_TNL_PORT_H >> +#define OVS_TNL_PORT_H 1 >> + >> +#include <errno.h> >> +#include <stddef.h> >> +#include <stdint.h> >> +#include <inttypes.h> >> +#include <net/if.h> >> +#include <sys/socket.h> >> + >> +#include "flow.h" >> +#include "packets.h" >> +#include "util.h" >> + >> +#ifdef __cplusplus >> +extern "C" { >> +#endif >> + >> +odp_port_t tnl_port_lookup(const struct flow *flow, >> + struct flow_wildcards *wc); >> + >> +void tnl_port_insert_udp(odp_port_t port, ovs_be32 ip_dst, ovs_be16 >> udp_port, >> + const char dev_name[]); >> + >> +void tnl_port_delete(ovs_be32 ip_dst OVS_UNUSED, ovs_be16 udp_port >> OVS_UNUSED); >> + >> +void tnl_port_init(void); >> + >> +#ifdef __cplusplus >> +} >> +#endif >> + >> +#endif >> diff --git a/lib/tnl-router.c b/lib/tnl-router.c >> new file mode 100644 >> index 0000000..4397509 >> --- /dev/null >> +++ b/lib/tnl-router.c >> @@ -0,0 +1,237 @@ >> +/* >> + * Copyright (c) 2014 Nicira, Inc. >> + * >> + * Licensed under the Apache License, Version 2.0 (the "License"); >> + * you may not use this file except in compliance with the License. >> + * You may obtain a copy of the License at: >> + * >> + * http://www.apache.org/licenses/LICENSE-2.0 >> + * >> + * Unless required by applicable law or agreed to in writing, software >> + * distributed under the License is distributed on an "AS IS" BASIS, >> + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. >> + * See the License for the specific language governing permissions and >> + * limitations under the License. >> + */ >> + >> +#include <config.h> >> +#include <arpa/inet.h> >> +#include <errno.h> >> +#include <inttypes.h> >> +#include <sys/socket.h> >> +#include <net/if.h> >> +#include <netinet/in.h> >> +#include <stdarg.h> >> +#include <stdlib.h> >> +#include <string.h> >> +#include <unistd.h> >> + >> +#include "classifier.h" >> +#include "command-line.h" >> +#include "compiler.h" >> +#include "dirs.h" >> +#include "dpif.h" >> +#include "dynamic-string.h" >> +#include "fat-rwlock.h" >> +#include "flow.h" >> +#include "match.h" >> +#include "netdev.h" >> +#include "netlink.h" >> +#include "odp-util.h" >> +#include "ofp-parse.h” > > Is this (and may other) includes necessary? Seems like a long list of > includes for a simple source file! > ok, fixed. >> +#include "ofpbuf.h" >> +#include "ovs-thread.h" >> +#include "packets.h" >> +#include "shash.h" >> +#include "simap.h" >> +#include "smap.h" >> +#include "sset.h" >> +#include "timeval.h" >> +#include "tnl-router.h" >> +#include "tnl-ports.h" >> +#include "unixctl.h" >> +#include "util.h" >> +#include "list.h" >> +#include "util.h" >> + >> +static struct classifier cls; >> +static struct ovs_mutex mutex = OVS_MUTEX_INITIALIZER; >> +static int version, c_ver; >> + >> +struct tnl_route_entry { >> + struct cls_rule cr; >> + char output_bridge[IFNAMSIZ]; >> + ovs_be32 gw; >> + ovs_be32 nw_addr; >> + ovs_be32 mask; >> +}; >> + >> +bool >> +tnl_route_reconfigured(void) >> +{ >> + if (version == c_ver) { >> + return false; >> + } >> + /* version is not commited, send refconfigure event. */ >> + c_ver++; >> + return true; >> +} >> + >> +static struct tnl_route_entry * >> +tnl_route_entry_cast(const struct cls_rule *cr) >> +{ >> + return cr ? CONTAINER_OF(cr, struct tnl_route_entry, cr) : NULL; >> +} >> + >> +int >> +tnl_route_lookup(ovs_be32 ip_dst, char output_bridge[], ovs_be32 *gw) >> +{ >> + const struct cls_rule *cr; >> + struct flow flow; >> + >> + memset(&flow, 0, sizeof(flow)); >> + flow.nw_dst = ip_dst; >> + >> + cr = classifier_lookup(&cls, &flow, NULL); >> + if (cr) { >> + struct tnl_route_entry *p; >> + >> + p = tnl_route_entry_cast(cr); >> + strncpy(output_bridge, p->output_bridge, IFNAMSIZ); >> + *gw = p->gw; >> + return 0; >> + } >> + return -ENOENT; >> +} >> + >> +bool >> +tnl_route_check_dev(const char output_bridge[]) >> +{ >> + struct tnl_route_entry *rt; >> + bool res = false; >> + >> + ovs_mutex_lock(&mutex); >> + > > Locking here is not necessary. > ok, you patch is fixing it. >> + CLS_FOR_EACH(rt, cr, &cls) { >> + if (!strncmp(output_bridge, rt->output_bridge, IFNAMSIZ)) { >> + res = true; >> + } >> + } >> + ovs_mutex_unlock(&mutex); >> + return res; >> +} >> + >> +static void >> +rt_entry_insert(ovs_be32 ip_dst, ovs_be32 mask, >> + const char output_bridge[], ovs_be32 gw) >> +{ >> + const struct cls_rule *cr; >> + struct tnl_route_entry *p; >> + struct flow_wildcards wc; >> + struct flow flow; >> + struct match match; >> + >> + memset(&flow, 0, sizeof(flow)); >> + flow.nw_dst = ip_dst; >> + >> + cr = classifier_lookup(&cls, &flow, NULL); >> + if (cr) { >> + return; >> + } >> + >> + p = xzalloc(sizeof *p); >> + strncpy(p->output_bridge, output_bridge, IFNAMSIZ); >> + p->gw = gw; >> + p->nw_addr = ip_dst; >> + p->mask = mask; >> + >> + memset(&wc, 0, sizeof(wc)); >> + wc.masks.nw_dst = mask; >> + /* Need to check IP address. */ >> + match_init(&match, &flow, &wc); >> + cls_rule_init(&p->cr, &match, 1); >> + >> + ovs_mutex_lock(&mutex); >> + classifier_insert(&cls, &p->cr); >> + version++; >> + ovs_mutex_unlock(&mutex); > > if version was an atomic variable, then the locking would not be necessary. > I removed this part from the patch. >> +} >> + >> +static void >> +tnl_route_add(struct unixctl_conn *conn, int argc, >> + const char *argv[], void *aux OVS_UNUSED) >> +{ >> + ovs_be32 ip, mask, gw; >> + >> + inet_aton(argv[1], (struct in_addr *)&ip); >> + inet_aton(argv[2], (struct in_addr *)&mask); >> + >> + if (argc == 5) { >> + inet_aton(argv[4], (struct in_addr *)&gw); >> + } else { >> + gw = 0; >> + } >> + >> + rt_entry_insert(ip, mask, argv[3], gw); >> + unixctl_command_reply(conn, "OK"); >> +} >> + >> +static void >> +tnl_route_del(struct unixctl_conn *conn, int argc OVS_UNUSED, >> + const char *argv[], void *aux OVS_UNUSED) >> +{ >> + struct cls_rule *cr; >> + struct flow flow; >> + ovs_be32 ip, mask; >> + >> + inet_aton(argv[1], (struct in_addr *)&ip); >> + inet_aton(argv[2], (struct in_addr *)&mask); >> + >> + ovs_mutex_lock(&mutex); >> + memset(&flow, 0, sizeof(flow)); >> + flow.nw_dst = ip & mask; >> + >> + cr = classifier_lookup(&cls, &flow, NULL); >> + if (cr) { >> + classifier_remove(&cls, cr); >> + version++; >> + unixctl_command_reply(conn, "OK"); >> + } else { >> + unixctl_command_reply(conn, "Not found"); >> + } >> + >> + ovs_mutex_unlock(&mutex); > > This is a bit of a bummer, but locking here is necessary to make sure that > the ‘cr’ still exists in the classifier when classifier_remove is called. We > should add a new function to take care of this (like > classifier_remove_flow(&cls, &flow, priority) finds the exact matching rule > and removes it if found.) > ok. > >> +} >> + >> +static void >> +tnl_route_show(struct unixctl_conn *conn, int argc OVS_UNUSED, >> + const char *argv[] OVS_UNUSED, void *aux OVS_UNUSED) >> +{ >> + struct tnl_route_entry *rt; >> + struct ds ds = DS_EMPTY_INITIALIZER; >> + >> + ovs_mutex_lock(&mutex); >> + > > Locking is implicitly done for the duration of the iteration. > ok. >> + CLS_FOR_EACH(rt, cr, &cls) { >> + ds_put_format(&ds, "IP "IP_FMT" Mask "IP_FMT" dev %s", >> + IP_ARGS(rt->nw_addr), IP_ARGS(rt->mask), >> rt->output_bridge); >> + if (rt->gw) { >> + ds_put_format(&ds, " GW "IP_FMT, IP_ARGS(rt->gw)); >> + } >> + ds_put_format(&ds, "\n"); >> + } >> + unixctl_command_reply(conn, ds_cstr(&ds)); >> + ds_destroy(&ds); >> + >> + ovs_mutex_unlock(&mutex); >> +} >> + >> +void >> +tnl_route_unixctl_register(void) >> +{ >> + /* XXX: Add documentation for these commands. */ >> + unixctl_command_register("tnl/route/add", "ip mask dev gw", 3, 4, >> tnl_route_add, NULL); >> + unixctl_command_register("tnl/route/show", "", 0, 0, tnl_route_show, >> NULL); >> + unixctl_command_register("tnl/route/del", "ip mask", 2, 2, >> tnl_route_del, NULL); >> + classifier_init(&cls, NULL); > > Just to be on the safe side, I’d init classifier first, then register the > commands. > >> +} >> diff --git a/lib/tnl-router.h b/lib/tnl-router.h >> new file mode 100644 >> index 0000000..f780121 >> --- /dev/null >> +++ b/lib/tnl-router.h >> @@ -0,0 +1,42 @@ >> +/* >> + * Copyright (c) 2014 Nicira, Inc. >> + * >> + * Licensed under the Apache License, Version 2.0 (the "License"); >> + * you may not use this file except in compliance with the License. >> + * You may obtain a copy of the License at: >> + * >> + * http://www.apache.org/licenses/LICENSE-2.0 >> + * >> + * Unless required by applicable law or agreed to in writing, software >> + * distributed under the License is distributed on an "AS IS" BASIS, >> + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. >> + * See the License for the specific language governing permissions and >> + * limitations under the License. >> + */ >> + >> +#ifndef OVS_TNL_ROUTER_H >> +#define OVS_TNL_ROUTER_H 1 >> + >> +#include <stddef.h> >> +#include <stdint.h> >> +#include <net/if.h> >> + >> +#include "packets.h" >> +#include "timeval.h" >> +#include "unixctl.h" >> +#include "util.h" >> + >> +#ifdef __cplusplus >> +extern "C" { >> +#endif >> + >> +int tnl_route_lookup(ovs_be32 ip_dst, char out_dev[], ovs_be32 *gw); >> +void tnl_route_unixctl_register(void); >> +bool tnl_route_reconfigured(void); >> +bool tnl_route_check_dev(const char out_dev[]); >> + >> +#ifdef __cplusplus >> +} >> +#endif >> + >> +#endif >> diff --git a/lib/vlog.c b/lib/vlog.c >> index 42e4869..69b40b1 100644 >> --- a/lib/vlog.c >> +++ b/lib/vlog.c >> @@ -935,6 +935,7 @@ vlog(const struct vlog_module *module, enum vlog_level >> level, >> va_start(args, message); >> vlog_valist(module, level, message, args); >> va_end(args); >> + fflush(NULL); > > Have you considered the cost of this? > this was for debugging, I removed it. >> } >> >> /* Logs 'message' to 'module' at maximum verbosity, then exits with a failure >> diff --git a/lib/vxlan.h b/lib/vxlan.h >> new file mode 100644 >> index 0000000..0ae5a84 >> --- /dev/null >> +++ b/lib/vxlan.h >> @@ -0,0 +1,46 @@ >> +/* >> + * Copyright (c) 2014 Nicira, Inc. >> + * >> + * Licensed under the Apache License, Version 2.0 (the "License"); >> + * you may not use this file except in compliance with the License. >> + * You may obtain a copy of the License at: >> + * >> + * http://www.apache.org/licenses/LICENSE-2.0 >> + * >> + * Unless required by applicable law or agreed to in writing, software >> + * distributed under the License is distributed on an "AS IS" BASIS, >> + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. >> + * See the License for the specific language governing permissions and >> + * limitations under the License. >> + */ >> + >> +#ifndef NETDEV_VXLAN_H >> +#define NETDEV_VXLAN_H 1 >> + >> +#include <stddef.h> >> +#include <stdint.h> >> +#include "list.h" >> +#include "packets.h" >> +#include "util.h" >> +#include "netdev-dpdk.h" >> + >> +#ifdef __cplusplus >> +extern "C" { >> +#endif >> + >> +#define VXLAN_HLEN (sizeof(struct udphdr) + sizeof(struct vxlanhdr)) >> + >> +#define VXLAN_FLAGS 0x08000000 /* struct vxlanhdr.vx_flags required value. >> */ >> + >> +/* VXLAN protocol header */ >> +OVS_PACKED( >> +struct vxlanhdr { >> + ovs_be32 vx_flags; >> + ovs_be32 vx_vni; >> +}); > > Does this really need to be packed? > I got compiler warnings, I will try to remove them. >> + >> +#ifdef __cplusplus >> +} >> +#endif >> + >> +#endif >> diff --git a/ofproto/ofproto-dpif-xlate.c b/ofproto/ofproto-dpif-xlate.c >> index 880824d..9d41c30 100644 >> --- a/ofproto/ofproto-dpif-xlate.c >> +++ b/ofproto/ofproto-dpif-xlate.c >> @@ -17,7 +17,12 @@ >> #include "ofproto/ofproto-dpif-xlate.h" >> >> #include <errno.h> >> +#include <arpa/inet.h> >> +#include <net/if.h> >> +#include <sys/socket.h> >> +#include <netinet/in.h> >> >> +#include "tnl-arp-cache.h" >> #include "bfd.h" >> #include "bitmap.h" >> #include "bond.h" >> @@ -48,7 +53,9 @@ >> #include "ofproto/ofproto-dpif.h" >> #include "ofproto/ofproto-provider.h" >> #include "packet-dpif.h" >> +#include "tnl-router.h" >> #include "tunnel.h" >> +#include "tnl-ports.h" >> #include "vlog.h" >> >> COVERAGE_DEFINE(xlate_actions); >> @@ -112,6 +119,11 @@ struct xbridge { >> /* True if the datapath supports masked data in OVS_ACTION_ATTR_SET >> * actions. */ >> bool masked_set_action; >> + >> + /* True if tunnel route configure on this bridge. */\ > > “is configured” > ok. >> + bool has_tunnel_port; >> + >> + bool enable_userspace_tunneling; >> }; >> >> struct xbundle { >> @@ -374,7 +386,8 @@ static void xlate_xbridge_set(struct xbridge *, struct >> dpif *, >> bool enable_recirc, >> bool variable_length_userdata, >> size_t max_mpls_depth, >> - bool masked_set_action); >> + bool masked_set_action, >> + bool enable_userspace_tunneling); >> static void xlate_xbundle_set(struct xbundle *xbundle, >> enum port_vlan_mode vlan_mode, int vlan, >> unsigned long *trunks, bool use_priority_tags, >> @@ -440,7 +453,8 @@ xlate_xbridge_set(struct xbridge *xbridge, >> bool enable_recirc, >> bool variable_length_userdata, >> size_t max_mpls_depth, >> - bool masked_set_action) >> + bool masked_set_action, >> + bool enable_userspace_tunneling) >> { >> if (xbridge->ml != ml) { >> mac_learning_unref(xbridge->ml); >> @@ -492,6 +506,9 @@ xlate_xbridge_set(struct xbridge *xbridge, >> xbridge->variable_length_userdata = variable_length_userdata; >> xbridge->max_mpls_depth = max_mpls_depth; >> xbridge->masked_set_action = masked_set_action; >> + xbridge->has_tunnel_port = enable_userspace_tunneling && >> + tnl_route_check_dev(xbridge->name); >> + xbridge->enable_userspace_tunneling = enable_userspace_tunneling; >> } >> >> static void >> @@ -574,7 +591,8 @@ xlate_xbridge_copy(struct xbridge *xbridge) >> xbridge->frag, xbridge->forward_bpdu, >> xbridge->has_in_band, xbridge->enable_recirc, >> xbridge->variable_length_userdata, >> - xbridge->max_mpls_depth, xbridge->masked_set_action); >> + xbridge->max_mpls_depth, xbridge->masked_set_action, >> + xbridge->enable_userspace_tunneling); >> LIST_FOR_EACH (xbundle, list_node, &xbridge->xbundles) { >> xlate_xbundle_copy(new_xbridge, xbundle); >> } >> @@ -728,7 +746,7 @@ xlate_ofproto_set(struct ofproto_dpif *ofproto, const >> char *name, >> const struct netflow *netflow, enum ofp_config_flags frag, >> bool forward_bpdu, bool has_in_band, bool enable_recirc, >> bool variable_length_userdata, size_t max_mpls_depth, >> - bool masked_set_action) >> + bool masked_set_action, bool enable_userspace_tunneling) >> { >> struct xbridge *xbridge; >> >> @@ -749,7 +767,7 @@ xlate_ofproto_set(struct ofproto_dpif *ofproto, const >> char *name, >> rstp, ms, mbridge, sflow, ipfix, netflow, frag, >> forward_bpdu, has_in_band, enable_recirc, >> variable_length_userdata, max_mpls_depth, >> - masked_set_action); >> + masked_set_action, enable_userspace_tunneling); >> } >> >> static void >> @@ -2133,6 +2151,10 @@ xlate_normal(struct xlate_ctx *ctx) >> return; >> } >> >> + if (ctx->xbridge->has_tunnel_port) { >> + tnl_arp_snoop(flow, wc, ctx->xbridge->name); >> + } >> + >> /* Learn source MAC. */ >> if (ctx->xin->may_learn) { >> update_learning_table(ctx->xbridge, flow, wc, vlan, in_xbundle); >> @@ -2469,6 +2491,120 @@ process_special(struct xlate_ctx *ctx, const struct >> flow *flow, >> } >> } >> >> +static int >> +tnl_route_lookup_flow(const struct flow *oflow, >> + ovs_be32 *ip, struct xport **out_port) >> +{ >> + char out_dev[IFNAMSIZ]; >> + struct xbridge *xbridge; >> + struct xlate_cfg *xcfg; >> + ovs_be32 gw; >> + int res; >> + >> + res = tnl_route_lookup(oflow->tunnel.ip_dst, out_dev, &gw); >> + if (res) { >> + return res; >> + } >> + >> + if (gw) { >> + *ip = gw; >> + } else { >> + *ip = oflow->tunnel.ip_dst; >> + } >> + >> + xcfg = ovsrcu_get(struct xlate_cfg *, &xcfgp); >> + ovs_assert(xcfg); >> + >> + HMAP_FOR_EACH (xbridge, hmap_node, &xcfg->xbridges) { >> + if (!strncmp(xbridge->name, out_dev, IFNAMSIZ)) { >> + struct xport *port; >> + >> + HMAP_FOR_EACH (port, ofp_node, &xbridge->xports) { >> + if (!strncmp(netdev_get_name(port->netdev), out_dev, >> IFNAMSIZ)) { >> + *out_port = port; >> + return 0; >> + } >> + } >> + } >> + } > > Could the tnl_route_lookup return the ‘port’ directly? Would be much cheaper. > it does not know port number, Once we move odp port number info to netdev layer, we can use it here. But I think it can be done later. >> + return -ENOENT; >> +} >> + >> +static int >> +xlate_recv_packet(struct xbridge *xbridge, struct ofpbuf *packet) > > Should this be called slate_flood_packet() instead? Should we make > xlate_send_packet() support flooding instead? E.g., NULL port -> flood? > ok. >> +{ >> + struct ofpact_output output; >> + struct flow flow; >> + >> + ofpact_init(&output.ofpact, OFPACT_OUTPUT, sizeof output); >> + /* Use OFPP_NONE as the in_port to avoid special packet processing. */ >> + flow_extract(packet, NULL, &flow); >> + flow.in_port.ofp_port = OFPP_NONE; >> + output.port = OFPP_NORMAL; > > Could this be OFPP_FLOOD instead? > ok. >> + output.max_len = 0; >> + >> + return ofproto_dpif_execute_actions(xbridge->ofproto, &flow, NULL, >> + &output.ofpact, sizeof output, >> + packet); >> +} >> + >> + >> +static void >> +tnl_arp_request(const struct xport *out_dev, const uint8_t >> eth_src[ETH_ADDR_LEN], >> + ovs_be32 ip_src, ovs_be32 ip_dst) >> +{ >> + struct xbridge *xbridge = out_dev->xbridge; >> + struct ofpbuf packet; >> + >> + ofpbuf_init(&packet, 0); >> + compose_arp(&packet, eth_src, ip_src, ip_dst); >> + >> + xlate_recv_packet(xbridge, &packet); >> + ofpbuf_uninit(&packet); >> +} >> + >> +static int >> +build_tunnel_send(const struct xlate_ctx *ctx, const struct xport *xport, >> + const struct flow *flow, odp_port_t tunnel_odp_port) >> +{ >> + struct xport *out_dev = NULL; >> + struct ofpbuf tnl_header; >> + char header_stub[256]; >> + ovs_be32 s_ip, d_ip = 0; >> + uint8_t smac[ETH_ADDR_LEN]; >> + bool send_arp; >> + int res; >> + >> + res = tnl_route_lookup_flow(flow, &d_ip, &out_dev); >> + if (res) { >> + return res; >> + } >> + >> + /* Use mac addr of bridge port of the peer. */ >> + res = netdev_get_etheraddr(out_dev->netdev, smac); >> + if (res) { >> + return res; >> + } >> + res = netdev_get_in4(out_dev->netdev, (struct in_addr *) &s_ip, NULL); >> + if (res) { >> + return res; >> + } >> + >> + ofpbuf_use_stub(&tnl_header, header_stub, sizeof header_stub); >> + res = tnl_port_build_header(xport->ofport, d_ip, flow, >> + out_dev->xbridge->name, smac, s_ip, >> + &tnl_header, &send_arp); >> + if (res) { >> + if (send_arp) { >> + tnl_arp_request(out_dev, smac, s_ip, d_ip); >> + } >> + return res; >> + } >> + odp_put_build_tunnel_header_action(ctx->xout->odp_actions, &tnl_header, >> + tunnel_odp_port, out_dev->odp_port); >> + return 0; >> +} >> + >> static void >> compose_output_action__(struct xlate_ctx *ctx, ofp_port_t ofp_port, >> bool check_stp) >> @@ -2613,9 +2749,15 @@ compose_output_action__(struct xlate_ctx *ctx, >> ofp_port_t ofp_port, >> entry->u.dev.tx = netdev_ref(xport->netdev); >> } >> out_port = odp_port; >> - commit_odp_tunnel_action(flow, &ctx->base_flow, >> - ctx->xout->odp_actions); >> - flow->tunnel = flow_tnl; /* Restore tunnel metadata */ >> + if (ctx->xbridge->enable_userspace_tunneling) { >> + build_tunnel_send(ctx, xport, flow, odp_port); >> + goto out; >> + } else { >> + commit_odp_tunnel_action(flow, &ctx->base_flow, >> + ctx->xout->odp_actions); >> + flow->tunnel = flow_tnl; /* Restore tunnel metadata */ >> + } >> + >> } else { >> odp_port = xport->odp_port; >> out_port = odp_port; >> @@ -2635,7 +2777,7 @@ compose_output_action__(struct xlate_ctx *ctx, >> ofp_port_t ofp_port, >> if (out_port != ODPP_NONE) { >> ctx->xout->slow |= commit_odp_actions(flow, &ctx->base_flow, >> ctx->xout->odp_actions, >> - &ctx->xout->wc, >> + wc, >> >> ctx->xbridge->masked_set_action); >> >> if (ctx->use_recirc) { >> @@ -2653,6 +2795,18 @@ compose_output_action__(struct xlate_ctx *ctx, >> ofp_port_t ofp_port, >> nl_msg_put_u32(ctx->xout->odp_actions, OVS_ACTION_ATTR_RECIRC, >> xr->recirc_id); >> } else { >> + /* XXX: Write better Filter for tunnel port. We can use inport >> + * int tunnel-port flow to avoid these checks completely. */ >> + if (ofp_port == OFPP_LOCAL && ctx->xbridge->has_tunnel_port) { >> + odp_port_t odp_tnl_port; >> + >> + odp_tnl_port = tnl_port_lookup(flow, wc); >> + if (odp_tnl_port != ODPP_NONE) { >> + nl_msg_put_odp_port(ctx->xout->odp_actions, >> OVS_ACTION_ATTR_TUNNEL_POP, >> + odp_tnl_port); >> + goto out; >> + } >> + } >> add_ipfix_output_action(ctx, out_port); >> nl_msg_put_odp_port(ctx->xout->odp_actions, >> OVS_ACTION_ATTR_OUTPUT, >> out_port); >> @@ -3013,9 +3167,9 @@ execute_controller_action(struct xlate_ctx *ctx, int >> len, >> enum ofp_packet_in_reason reason, >> uint16_t controller_id) >> { >> + struct pkt_metadata md = PKT_METADATA_INITIALIZER(0); >> struct ofproto_packet_in *pin; >> struct dpif_packet *packet; >> - struct pkt_metadata md = PKT_METADATA_INITIALIZER(0); >> >> ctx->xout->slow |= SLOW_CONTROLLER; >> if (!ctx->xin->packet) { >> @@ -3023,13 +3177,13 @@ execute_controller_action(struct xlate_ctx *ctx, int >> len, >> } >> >> packet = dpif_packet_clone_from_ofpbuf(ctx->xin->packet); >> + packet->md = md; > > dpif_packet_clone_from_ofpbuf() already zero-initializes the md so it need > not be set again. > >> >> ctx->xout->slow |= commit_odp_actions(&ctx->xin->flow, &ctx->base_flow, >> ctx->xout->odp_actions, >> &ctx->xout->wc, >> ctx->xbridge->masked_set_action); >> >> - packet->md = md; >> odp_execute_actions(NULL, &packet, 1, false, >> ofpbuf_data(ctx->xout->odp_actions), >> ofpbuf_size(ctx->xout->odp_actions), NULL); >> diff --git a/ofproto/ofproto-dpif-xlate.h b/ofproto/ofproto-dpif-xlate.h >> index 5ef20b1..6cf80ad 100644 >> --- a/ofproto/ofproto-dpif-xlate.h >> +++ b/ofproto/ofproto-dpif-xlate.h >> @@ -154,7 +154,8 @@ void xlate_ofproto_set(struct ofproto_dpif *, const char >> *name, >> bool has_in_band, bool enable_recirc, >> bool variable_length_userdata, >> size_t mpls_label_stack_length, >> - bool masked_set_action); >> + bool masked_set_action, >> + bool enable_userspace_tunneling); >> void xlate_remove_ofproto(struct ofproto_dpif *); >> >> void xlate_bundle_set(struct ofproto_dpif *, struct ofbundle *, >> diff --git a/ofproto/ofproto-dpif.c b/ofproto/ofproto-dpif.c >> index d965d38..f22ac9c 100644 >> --- a/ofproto/ofproto-dpif.c >> +++ b/ofproto/ofproto-dpif.c >> @@ -59,6 +59,7 @@ >> #include "ofproto-dpif-upcall.h" >> #include "ofproto-dpif-xlate.h" >> #include "poll-loop.h" >> +#include "tnl-router.h" >> #include "seq.h" >> #include "simap.h" >> #include "smap.h" >> @@ -280,6 +281,9 @@ struct dpif_backer { >> /* Maximum number of MPLS label stack entries that the datapath supports >> * in a match */ >> size_t max_mpls_depth; >> + >> + /* True if the datapath supports tnl_push and pop actions. */ >> + bool enable_userspace_tunneling; >> }; >> >> /* All existing ofproto_backer instances, indexed by ofproto->up.type. */ >> @@ -581,7 +585,7 @@ type_run(const char *type) >> >> iter->odp_port = node ? u32_to_odp(node->data) : ODPP_NONE; >> if (tnl_port_reconfigure(iter, iter->up.netdev, >> - iter->odp_port)) { >> + iter->odp_port, dp_port)) { >> backer->need_revalidate = REV_RECONFIGURE; >> } >> } >> @@ -624,7 +628,8 @@ type_run(const char *type) >> ofproto->backer->enable_recirc, >> ofproto->backer->variable_length_userdata, >> ofproto->backer->max_mpls_depth, >> - ofproto->backer->masked_set_action); >> + ofproto->backer->masked_set_action, >> + ofproto->backer->enable_userspace_tunneling); >> >> HMAP_FOR_EACH (bundle, hmap_node, &ofproto->bundles) { >> xlate_bundle_set(ofproto, bundle, bundle->name, >> @@ -949,6 +954,9 @@ open_dpif_backer(const char *type, struct dpif_backer >> **backerp) >> backer->masked_set_action = check_masked_set_action(backer); >> backer->rid_pool = recirc_id_pool_create(); >> >> + backer->enable_userspace_tunneling = dpif_is_userspace_tunneling_on() && >> + >> dpif_support_userspace_tunneling(backer->dpif); >> + >> error = dpif_recv_set(backer->dpif, backer->recv_set_enable); >> if (error) { >> VLOG_ERR("failed to listen on datapath of type %s: %s", >> @@ -1225,6 +1233,7 @@ construct(struct ofproto *ofproto_) >> guarded_list_init(&ofproto->pins); >> >> ofproto_dpif_unixctl_init(); >> + tnl_route_unixctl_register(); >> >> hmap_init(&ofproto->vlandev_map); >> hmap_init(&ofproto->realdev_vid_map); >> @@ -1517,6 +1526,9 @@ run(struct ofproto *ofproto_) >> } >> } >> >> + if (tnl_route_reconfigured()) { >> + ofproto->backer->need_revalidate = REV_MCAST_SNOOPING; >> + } >> return 0; >> } >> >> @@ -1668,7 +1680,7 @@ port_construct(struct ofport *port_) >> port->odp_port = dpif_port.port_no; >> >> if (netdev_get_tunnel_config(netdev)) { >> - tnl_port_add(port, port->up.netdev, port->odp_port); >> + tnl_port_add(port, port->up.netdev, port->odp_port, namebuf); >> port->is_tunnel = true; >> if (ofproto->ipfix) { >> dpif_ipfix_add_tunnel_port(ofproto->ipfix, port_, port->odp_port); >> @@ -1759,24 +1771,28 @@ static void >> port_modified(struct ofport *port_) >> { >> struct ofport_dpif *port = ofport_dpif_cast(port_); >> + char namebuf[NETDEV_VPORT_NAME_BUFSIZE]; >> + struct netdev *netdev = port->up.netdev; >> >> if (port->bundle && port->bundle->bond) { >> - bond_slave_set_netdev(port->bundle->bond, port, port->up.netdev); >> + bond_slave_set_netdev(port->bundle->bond, port, netdev); >> } >> >> if (port->cfm) { >> - cfm_set_netdev(port->cfm, port->up.netdev); >> + cfm_set_netdev(port->cfm, netdev); >> } >> >> if (port->bfd) { >> - bfd_set_netdev(port->bfd, port->up.netdev); >> + bfd_set_netdev(port->bfd, netdev); >> } >> >> ofproto_dpif_monitor_port_update(port, port->bfd, port->cfm, >> port->up.pp.hw_addr); >> >> - if (port->is_tunnel && tnl_port_reconfigure(port, port->up.netdev, >> - port->odp_port)) { >> + netdev_vport_get_dpif_port(netdev, namebuf, sizeof namebuf); >> + >> + if (port->is_tunnel && tnl_port_reconfigure(port, netdev, >> + port->odp_port, namebuf)) { >> ofproto_dpif_cast(port->up.ofproto)->backer->need_revalidate = >> REV_RECONFIGURE; >> } >> @@ -3462,8 +3478,10 @@ ofproto_dpif_execute_actions(struct ofproto_dpif >> *ofproto, >> struct xlate_in xin; >> ofp_port_t in_port; >> struct dpif_execute execute; >> + struct ds ds; >> int error; >> >> + ds_init(&ds); >> ovs_assert((rule != NULL) != (ofpacts != NULL)); >> >> dpif_flow_stats_extract(flow, packet, time_msec(), &stats); >> @@ -3481,6 +3499,10 @@ ofproto_dpif_execute_actions(struct ofproto_dpif >> *ofproto, >> >> execute.actions = ofpbuf_data(xout.odp_actions); >> execute.actions_len = ofpbuf_size(xout.odp_actions); >> + >> + format_odp_actions(&ds, ofpbuf_data(xout.odp_actions), >> + ofpbuf_size(xout.odp_actions)); >> + > > These additions look like dead code now. > right, removed. it. >> execute.packet = packet; >> execute.md = pkt_metadata_from_flow(flow); >> execute.needs_help = (xout.slow & SLOW_ACTION) != 0; >> diff --git a/ofproto/tunnel.c b/ofproto/tunnel.c >> index 46b0719..120af46 100644 >> --- a/ofproto/tunnel.c >> +++ b/ofproto/tunnel.c >> @@ -17,19 +17,25 @@ >> >> #include <errno.h> >> >> +#include "tnl-arp-cache.h" >> #include "byte-order.h" >> #include "connectivity.h" >> +#include "dpif.h" >> #include "dynamic-string.h" >> #include "hash.h" >> #include "hmap.h" >> #include "netdev.h" >> #include "odp-util.h" >> +#include "ofpbuf.h" >> #include "packets.h" >> #include "seq.h" >> #include "smap.h" >> #include "socket-util.h" >> #include "tunnel.h" >> +#include "tnl-ports.h" >> #include "vlog.h" >> +#include "unaligned.h" >> +#include "ofproto-dpif.h" >> >> VLOG_DEFINE_THIS_MODULE(tunnel); >> >> @@ -124,7 +130,7 @@ static void tnl_port_del__(const struct ofport_dpif *) >> OVS_REQ_WRLOCK(rwlock); >> >> static bool >> tnl_port_add__(const struct ofport_dpif *ofport, const struct netdev *netdev, >> - odp_port_t odp_port, bool warn) >> + odp_port_t odp_port, bool warn, const char name[]) >> OVS_REQ_WRLOCK(rwlock) >> { >> const struct netdev_tunnel_config *cfg; >> @@ -173,6 +179,12 @@ tnl_port_add__(const struct ofport_dpif *ofport, const >> struct netdev *netdev, >> } >> hmap_insert(*map, &tnl_port->match_node, tnl_hash(&tnl_port->match)); >> tnl_port_mod_log(tnl_port, "adding"); >> + >> + if (dpif_is_userspace_tunneling_on()) { >> + /* XXX: Add generic API to handle other tunneling protocols. */ >> + tnl_port_insert_udp(odp_port, tnl_port->match.ip_dst, >> + cfg->dst_port, name); >> + } >> return true; >> } >> >> @@ -181,10 +193,10 @@ tnl_port_add__(const struct ofport_dpif *ofport, const >> struct netdev *netdev, >> * tunnel. */ >> void >> tnl_port_add(const struct ofport_dpif *ofport, const struct netdev *netdev, >> - odp_port_t odp_port) OVS_EXCLUDED(rwlock) >> + odp_port_t odp_port, const char name[]) OVS_EXCLUDED(rwlock) >> { >> ovs_rwlock_wrlock(&rwlock); >> - tnl_port_add__(ofport, netdev, odp_port, true); >> + tnl_port_add__(ofport, netdev, odp_port, true, name); >> ovs_rwlock_unlock(&rwlock); >> } >> >> @@ -194,7 +206,8 @@ tnl_port_add(const struct ofport_dpif *ofport, const >> struct netdev *netdev, >> * tnl_port_add(). */ >> bool >> tnl_port_reconfigure(const struct ofport_dpif *ofport, >> - const struct netdev *netdev, odp_port_t odp_port) >> + const struct netdev *netdev, odp_port_t odp_port, >> + const char name[]) >> OVS_EXCLUDED(rwlock) >> { >> struct tnl_port *tnl_port; >> @@ -203,13 +216,13 @@ tnl_port_reconfigure(const struct ofport_dpif *ofport, >> ovs_rwlock_wrlock(&rwlock); >> tnl_port = tnl_find_ofport(ofport); >> if (!tnl_port) { >> - changed = tnl_port_add__(ofport, netdev, odp_port, false); >> + changed = tnl_port_add__(ofport, netdev, odp_port, false, name); >> } else if (tnl_port->netdev != netdev >> || tnl_port->match.odp_port != odp_port >> || tnl_port->change_seq != seq_read(connectivity_seq_get())) { >> VLOG_DBG("reconfiguring %s", tnl_port_get_name(tnl_port)); >> tnl_port_del__(ofport); >> - tnl_port_add__(ofport, netdev, odp_port, true); >> + tnl_port_add__(ofport, netdev, odp_port, true, name); >> changed = true; >> } >> ovs_rwlock_unlock(&rwlock); >> @@ -227,8 +240,11 @@ tnl_port_del__(const struct ofport_dpif *ofport) >> OVS_REQ_WRLOCK(rwlock) >> >> tnl_port = tnl_find_ofport(ofport); >> if (tnl_port) { >> + const struct netdev_tunnel_config *cfg = >> + netdev_get_tunnel_config(tnl_port->netdev); >> struct hmap **map; >> >> + tnl_port_delete(tnl_port->match.ip_dst, cfg->dst_port); >> tnl_port_mod_log(tnl_port, "removing"); >> map = tnl_match_map(&tnl_port->match); >> hmap_remove(*map, &tnl_port->match_node); >> @@ -639,3 +655,54 @@ tnl_port_get_name(const struct tnl_port *tnl_port) >> OVS_REQ_RDLOCK(rwlock) >> { >> return netdev_get_name(tnl_port->netdev); >> } >> + >> +int >> +tnl_port_build_header(const struct ofport_dpif *ofport, ovs_be32 router_ip, >> + const struct flow *tnl_flow, >> + const char br_name[], uint8_t smac[ETH_ADDR_LEN], >> + ovs_be32 ip_src, struct ofpbuf *header, >> + bool *send_arp) >> +{ >> + struct tnl_port *tnl_port; >> + uint8_t dmac[ETH_ADDR_LEN]; >> + struct eth_header *eth; >> + struct ip_header *ip; >> + int res; >> + >> + *send_arp = false; >> + ovs_rwlock_rdlock(&rwlock); >> + tnl_port = tnl_find_ofport(ofport); >> + ovs_assert(tnl_port); >> + >> + res = tnl_arp_lookup(br_name, router_ip, dmac); >> + if (res) { >> + *send_arp = true; >> + goto out; >> + } >> + >> + /* Build Ethernet and IP headers. */ >> + ofpbuf_clear(header); >> + >> + eth = ofpbuf_put_uninit(header, sizeof *eth); >> + memcpy(eth->eth_dst, dmac, ETH_ADDR_LEN); >> + memcpy(eth->eth_src, smac, ETH_ADDR_LEN); >> + eth->eth_type = htons(ETH_TYPE_IP); >> + >> + ip = ofpbuf_put_zeros(header, sizeof *ip); >> + ip->ip_ihl_ver = IP_IHL_VER(5, 4); >> + ip->ip_tos = tnl_flow->tunnel.ip_tos; >> + ip->ip_ttl = tnl_flow->tunnel.ip_ttl; >> + ip->ip_proto = IPPROTO_UDP; >> + ip->ip_frag_off = (tnl_flow->tunnel.flags & FLOW_TNL_F_DONT_FRAGMENT) ? >> + htons(IP_DF) : 0; >> + >> + put_16aligned_be32(&ip->ip_src, ip_src); >> + put_16aligned_be32(&ip->ip_dst, tnl_flow->tunnel.ip_dst); >> + ip->ip_csum = 0; >> + >> + netdev_build_header(tnl_port->netdev, header); >> +out: >> + ovs_rwlock_unlock(&rwlock); >> + >> + return res; >> +} >> diff --git a/ofproto/tunnel.h b/ofproto/tunnel.h >> index 27a2f7d..611c139 100644 >> --- a/ofproto/tunnel.h >> +++ b/ofproto/tunnel.h >> @@ -29,10 +29,10 @@ struct ofport_dpif; >> struct netdev; >> >> bool tnl_port_reconfigure(const struct ofport_dpif *, const struct netdev *, >> - odp_port_t); >> + odp_port_t, const char name[]); >> >> void tnl_port_add(const struct ofport_dpif *, const struct netdev *, >> - odp_port_t odp_port); >> + odp_port_t odp_port, const char name[]); >> void tnl_port_del(const struct ofport_dpif *); >> >> const struct ofport_dpif *tnl_port_receive(const struct flow *); >> @@ -48,4 +48,10 @@ tnl_port_should_receive(const struct flow *flow) >> return flow->tunnel.ip_dst != 0; >> } >> >> +int tnl_port_build_header(const struct ofport_dpif *ofport, ovs_be32 >> router_ip, >> + const struct flow *tnl_flow, >> + const char br_name[], uint8_t smac[ETH_ADDR_LEN], >> + ovs_be32 ip_src, struct ofpbuf *header, >> + bool *send_arp); >> + >> #endif /* tunnel.h */ >> diff --git a/rhel/openvswitch.spec.in b/rhel/openvswitch.spec.in >> index 56b7404..b63f00c 100644 >> --- a/rhel/openvswitch.spec.in >> +++ b/rhel/openvswitch.spec.in >> @@ -175,6 +175,6 @@ exit 0 >> /usr/share/openvswitch/vswitch.ovsschema >> /usr/share/openvswitch/vtep.ovsschema >> %doc COPYING DESIGN INSTALL.SSL NOTICE README.md WHY-OVS FAQ NEWS >> -%doc INSTALL.DPDK rhel/README.RHEL >> +%doc INSTALL.DPDK rhel/README.RHEL README-Userspace-Tunneling >> /var/lib/openvswitch >> /var/log/openvswitch >> diff --git a/vswitchd/ovs-vswitchd.8.in b/vswitchd/ovs-vswitchd.8.in >> index 6e4929c..56e4a91 100644 >> --- a/vswitchd/ovs-vswitchd.8.in >> +++ b/vswitchd/ovs-vswitchd.8.in >> @@ -82,6 +82,10 @@ Some systems do not support \fBmlockall()\fR at all, and >> other systems >> only allow privileged users, such as the superuser, to use it. >> \fBovs\-vswitchd\fR emits a log message if \fBmlockall()\fR is >> unavailable or unsuccessful. >> +.IP "\fB\-\-userspace-tunneling\fR" >> +Enable userspace tunneling in \fBovs\-vswitchd\fR for userspace datapath. >> User >> +need to do special configuration to make it work. For more info refer to >> +README-Userspace-Tunneling. >> . >> .SS "DPDK Options" >> .IP "\fB\-\-dpdk\fR" >> diff --git a/vswitchd/ovs-vswitchd.c b/vswitchd/ovs-vswitchd.c >> index 3c82f0f..b90a592 100644 >> --- a/vswitchd/ovs-vswitchd.c >> +++ b/vswitchd/ovs-vswitchd.c >> @@ -149,6 +149,7 @@ parse_options(int argc, char *argv[], char >> **unixctl_pathp) >> OPT_DISABLE_SYSTEM, >> DAEMON_OPTION_ENUMS, >> OPT_DPDK, >> + OPT_ENABLE_USER_TUNNEL, >> }; >> static const struct option long_options[] = { >> {"help", no_argument, NULL, 'h'}, >> @@ -163,6 +164,7 @@ parse_options(int argc, char *argv[], char >> **unixctl_pathp) >> {"enable-dummy", optional_argument, NULL, OPT_ENABLE_DUMMY}, >> {"disable-system", no_argument, NULL, OPT_DISABLE_SYSTEM}, >> {"dpdk", required_argument, NULL, OPT_DPDK}, >> + {"userspace-tunneling", no_argument, NULL, OPT_ENABLE_USER_TUNNEL}, >> {NULL, 0, NULL, 0}, >> }; >> char *short_options = long_options_to_short_options(long_options); >> @@ -211,6 +213,10 @@ parse_options(int argc, char *argv[], char >> **unixctl_pathp) >> dp_blacklist_provider("system"); >> break; >> >> + case OPT_ENABLE_USER_TUNNEL: >> + dpif_enable_userspace_tunneling(); >> + break; >> + > > IMO we should use probing instead of explicit configuration parameter. > removed this vswitch option. > Jarno > > >> case '?': >> exit(EXIT_FAILURE); >> >> -- >> 1.9.3 >> >> _______________________________________________ >> dev mailing list >> dev@openvswitch.org >> http://openvswitch.org/mailman/listinfo/dev > _______________________________________________ dev mailing list dev@openvswitch.org http://openvswitch.org/mailman/listinfo/dev