This is sufficient support that an L3 logical router can now transmit packets to VMs (and other destinations) without having to know the IP-to-MAC binding in advance. The details are carefully documented in all of the appropriate places.
There are several important caveats that need to be fixed before this can be taken seriously in production. These are documented in ovn/TODO. The most important of these are renewal, expiration, and limiting the size of the ARP table. Signed-off-by: Ben Pfaff <b...@ovn.org> --- ovn/TODO | 55 ++--------- ovn/controller/lflow.c | 86 +++++++++++++++-- ovn/controller/lflow.h | 1 + ovn/controller/ovn-controller.c | 9 +- ovn/controller/pinctrl.c | 197 +++++++++++++++++++++++++++++++++----- ovn/controller/pinctrl.h | 8 +- ovn/lib/actions.c | 203 +++++++++++++++++++++++++++++++++++++--- ovn/lib/actions.h | 11 +++ ovn/lib/expr.c | 53 +++++++++++ ovn/lib/expr.h | 3 + ovn/northd/ovn-northd.8.xml | 112 +++++++++++++--------- ovn/northd/ovn-northd.c | 105 ++++++++++++++++----- ovn/ovn-architecture.7.xml | 78 ++++++++++----- ovn/ovn-sb.ovsschema | 15 ++- ovn/ovn-sb.xml | 137 ++++++++++++++++++++++++++- ovn/utilities/ovn-sbctl.c | 4 + tests/ovn.at | 187 +++++++++++++++++++++++++++++++++--- tests/test-ovn.c | 1 + 18 files changed, 1053 insertions(+), 212 deletions(-) diff --git a/ovn/TODO b/ovn/TODO index b08a4f4..93ef4a3 100644 --- a/ovn/TODO +++ b/ovn/TODO @@ -4,18 +4,11 @@ ** New OVN logical actions -*** arp - -Generates an ARP packet based on the current IPv4 packet and allows it -to be processed as part of the current pipeline (and then pop back to -processing the original IPv4 packet). +*** rate_limit TCP/IP stacks typically limit the rate at which ARPs are sent, e.g. to one per second for a given target. We might need to do this too. -We probably need to buffer the packet that generated the ARP. I don't -know where to do that. - *** icmp4 { action... } Generates an ICMPv4 packet based on the current IPv4 packet and @@ -60,37 +53,13 @@ the "arp" action, and an action for generating ** Dynamic IP to MAC bindings -Some bindings from IP address to MAC will undoubtedly need to be -discovered dynamically through ARP requests. It's straightforward -enough for a logical L3 router to generate ARP requests and forward -them to the appropriate switch. - -It's more difficult to figure out where the reply should be processed -and stored. It might seem at first that a first-cut implementation -could just keep track of the binding on the hypervisor that needs to -know, but that can't happen easily because the VM that sends the reply -might not be on the same HV as the VM that needs the answer (that is, -the VM that sent the packet that needs the binding to be resolved) and -there isn't an easy way for it to know which HV needs the answer. - -Thus, the HV that processes the ARP reply (which is unknown when the -ARP is sent) has to tell all the HVs the binding. The most obvious -place for this in the OVN_Southbound database. - -Details need to be worked out, including: - -*** OVN_Southbound schema changes. +OVN has basic support for establishing IP to MAC bindings dynamically, +using ARP. -Possibly bindings could be added to the Port_Binding table by adding -or modifying columns. Another possibility is that another table -should be added. +*** Ratelimiting. -*** Logical_Flow representation - -It would be really nice to maintain the general-purpose nature of -logical flows, but these bindings might have to include some -hard-coded special cases, especially when it comes to the relationship -with populating the bindings into the OVN_Southbound table. +From casual observation, Linux appears to generate at most one ARP per +second per destination. *** Tracking queries @@ -104,16 +73,12 @@ into the database. Something needs to make sure that bindings remain valid and expire those that become stale. -** MTU handling (fragmentation on output) - -** Ratelimiting. +*** Table size limiting. -*** ARP. +The table of MAC bindings must not be allowed to grow unreasonably +large. -*** ICMP error generation, TCP reset, UDP unreachable, protocol unreachable, ... - -As a point of comparison, Linux doesn't ratelimit TCP resets but I -think it does everything else. +** MTU handling (fragmentation on output) * ovn-controller diff --git a/ovn/controller/lflow.c b/ovn/controller/lflow.c index bd08ad6..f16c5c9 100644 --- a/ovn/controller/lflow.c +++ b/ovn/controller/lflow.c @@ -191,15 +191,13 @@ is_switch(const struct sbrec_datapath_binding *ldp) } -/* Translates logical flows in the Logical_Flow table in the OVN_SB database - * into OpenFlow flows. See ovn-architecture(7) for more information. */ -void -lflow_run(struct controller_ctx *ctx, const struct lport_index *lports, - const struct mcgroup_index *mcgroups, - const struct hmap *local_datapaths, - const struct simap *ct_zones, struct hmap *flow_table) +/* Adds the logical flows from the Logical_Flow table to 'flow_table'. */ +static void +add_logical_flows(struct controller_ctx *ctx, const struct lport_index *lports, + const struct mcgroup_index *mcgroups, + const struct hmap *local_datapaths, + const struct simap *ct_zones, struct hmap *flow_table) { - struct hmap flows = HMAP_INITIALIZER(&flows); uint32_t conj_id_ofs = 1; const struct sbrec_logical_flow *lflow; @@ -273,6 +271,7 @@ lflow_run(struct controller_ctx *ctx, const struct lport_index *lports, .first_ptable = first_ptable, .cur_ltable = lflow->table_id, .output_ptable = output_ptable, + .arp_ptable = OFTABLE_MAC_BINDING, }; error = actions_parse_string(lflow->actions, &ap, &ofpacts, &prereqs); if (error) { @@ -349,6 +348,77 @@ lflow_run(struct controller_ctx *ctx, const struct lport_index *lports, } } +static void +put_load(const uint8_t *data, size_t len, + enum mf_field_id dst, int ofs, int n_bits, + struct ofpbuf *ofpacts) +{ + struct ofpact_set_field *sf = ofpact_put_SET_FIELD(ofpacts); + sf->field = mf_from_id(dst); + sf->flow_has_vlan = false; + + bitwise_copy(data, len, 0, &sf->value, sf->field->n_bytes, ofs, n_bits); + bitwise_one(&sf->mask, sf->field->n_bytes, ofs, n_bits); +} + +/* Adds a flow to table */ +static void +add_neighbor_flows(struct controller_ctx *ctx, + const struct lport_index *lports, struct hmap *flow_table) +{ + struct ofpbuf ofpacts; + struct match match; + match_init_catchall(&match); + ofpbuf_init(&ofpacts, 0); + + const struct sbrec_mac_binding *b; + SBREC_MAC_BINDING_FOR_EACH (b, ctx->ovnsb_idl) { + const struct sbrec_port_binding *pb + = lport_lookup_by_name(lports, b->logical_port); + if (!pb) { + continue; + } + + struct eth_addr mac; + if (!eth_addr_from_string(b->mac, &mac)) { + static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(5, 1); + VLOG_WARN_RL(&rl, "bad 'mac' %s", b->mac); + continue; + } + + ovs_be32 ip; + if (!ip_parse(b->ip, &ip)) { + static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(5, 1); + VLOG_WARN_RL(&rl, "bad 'ip' %s", b->ip); + continue; + } + + match_set_metadata(&match, htonll(pb->datapath->tunnel_key)); + match_set_reg(&match, MFF_LOG_OUTPORT - MFF_REG0, pb->tunnel_key); + match_set_reg(&match, 0, ntohl(ip)); + + ofpbuf_clear(&ofpacts); + put_load(mac.ea, sizeof mac.ea, MFF_ETH_DST, 0, 48, &ofpacts); + + ofctrl_add_flow(flow_table, OFTABLE_MAC_BINDING, 100, + &match, &ofpacts); + } + ofpbuf_uninit(&ofpacts); +} + +/* Translates logical flows in the Logical_Flow table in the OVN_SB database + * into OpenFlow flows. See ovn-architecture(7) for more information. */ +void +lflow_run(struct controller_ctx *ctx, const struct lport_index *lports, + const struct mcgroup_index *mcgroups, + const struct hmap *local_datapaths, + const struct simap *ct_zones, struct hmap *flow_table) +{ + add_logical_flows(ctx, lports, mcgroups, local_datapaths, + ct_zones, flow_table); + add_neighbor_flows(ctx, lports, flow_table); +} + void lflow_destroy(void) { diff --git a/ovn/controller/lflow.h b/ovn/controller/lflow.h index 603edfd..ff823d4 100644 --- a/ovn/controller/lflow.h +++ b/ovn/controller/lflow.h @@ -53,6 +53,7 @@ struct uuid; #define OFTABLE_DROP_LOOPBACK 34 #define OFTABLE_LOG_EGRESS_PIPELINE 48 /* First of LOG_PIPELINE_LEN tables. */ #define OFTABLE_LOG_TO_PHY 64 +#define OFTABLE_MAC_BINDING 65 /* The number of tables for the ingress and egress pipelines. */ #define LOG_PIPELINE_LEN 16 diff --git a/ovn/controller/ovn-controller.c b/ovn/controller/ovn-controller.c index 517ca2a..9503cc9 100644 --- a/ovn/controller/ovn-controller.c +++ b/ovn/controller/ovn-controller.c @@ -302,7 +302,7 @@ main(int argc, char *argv[]) enum mf_field_id mff_ovn_geneve = ofctrl_run(br_int); - pinctrl_run(br_int); + pinctrl_run(&ctx, &lports, br_int); struct hmap flow_table = HMAP_INITIALIZER(&flow_table); lflow_run(&ctx, &lports, &mcgroups, &local_datapaths, @@ -336,13 +336,12 @@ main(int argc, char *argv[]) poll_immediate_wake(); } - ovsdb_idl_loop_commit_and_wait(&ovnsb_idl_loop); - ovsdb_idl_loop_commit_and_wait(&ovs_idl_loop); - if (br_int) { ofctrl_wait(); - pinctrl_wait(); + pinctrl_wait(&ctx); } + ovsdb_idl_loop_commit_and_wait(&ovnsb_idl_loop); + ovsdb_idl_loop_commit_and_wait(&ovs_idl_loop); poll_block(); if (should_service_stop()) { exiting = true; diff --git a/ovn/controller/pinctrl.c b/ovn/controller/pinctrl.c index 7b6f094..9be0882 100644 --- a/ovn/controller/pinctrl.c +++ b/ovn/controller/pinctrl.c @@ -20,14 +20,20 @@ #include "dirs.h" #include "dp-packet.h" +#include "lport.h" #include "ofp-actions.h" +#include "ovn/lib/actions.h" +#include "ovn/lib/logical-fields.h" #include "ofp-msgs.h" #include "ofp-print.h" #include "ofp-util.h" #include "ovn/lib/actions.h" #include "rconn.h" #include "openvswitch/vlog.h" +#include "ovn-controller.h" +#include "poll-loop.h" #include "socket-util.h" +#include "timeval.h" #include "vswitch-idl.h" VLOG_DEFINE_THIS_MODULE(pinctrl); @@ -39,6 +45,13 @@ static struct rconn *swconn; * rconn_get_connection_seqno(rconn), 'swconn' has reconnected. */ static unsigned int conn_seq_no; +static void pinctrl_handle_put_arp(const struct lport_index *, + const struct flow *md, + const struct flow *headers); +static void flush_put_arps(void); +static void run_put_arps(struct controller_ctx *); +static void wait_put_arps(struct controller_ctx *); + void pinctrl_init(void) { @@ -81,7 +94,8 @@ set_switch_config(struct rconn *swconn, } static void -pinctrl_handle_arp(const struct flow *ip_flow, struct ofpbuf *userdata) +pinctrl_handle_arp(const struct flow *ip_flow, const struct match *md, + struct ofpbuf *userdata) { /* Compose an ARP packet. */ uint64_t packet_stub[128 / 8]; @@ -106,26 +120,36 @@ pinctrl_handle_arp(const struct flow *ip_flow, struct ofpbuf *userdata) /* Compose actions. * - * First, add actions to restore the metadata, then add actions from - * 'userdata'. + * First, copy metadata from 'md' into the packet-out via "set_field" + * actions, then add actions from 'userdata'. */ - uint64_t ofpacts_stub[1024 / 8]; + uint64_t ofpacts_stub[4096 / 8]; struct ofpbuf ofpacts = OFPBUF_STUB_INITIALIZER(ofpacts_stub); enum ofp_version version = rconn_get_version(swconn); - for (int id = 0; id < MFF_N_IDS; id++) { - const struct mf_field *field = mf_from_id(id); - - if (field->prereqs == MFP_NONE - && field->writable - && id != MFF_IN_PORT && id != MFF_IN_PORT_OXM - && mf_is_set(field, ip_flow)) - { + enum mf_field_id md_fields[] = { +#if FLOW_N_REGS == 8 + MFF_REG0, + MFF_REG1, + MFF_REG2, + MFF_REG3, + MFF_REG4, + MFF_REG5, + MFF_REG6, + MFF_REG7, +#else +#error +#endif + MFF_METADATA, + }; + for (size_t i = 0; i < ARRAY_SIZE(md_fields); i++) { + const struct mf_field *field = mf_from_id(md_fields[i]); + if (!mf_is_all_wild(field, &md->wc)) { struct ofpact_set_field *sf = ofpact_put_SET_FIELD(&ofpacts); sf->field = field; sf->flow_has_vlan = false; - mf_get_value(field, ip_flow, &sf->value); - bitwise_one(&sf->mask, sizeof sf->mask, 0, field->n_bits); + mf_get_value(field, &md->flow, &sf->value); + memset(&sf->mask, 0xff, field->n_bytes); } } enum ofperr error = ofpacts_pull_openflow_actions(userdata, userdata->size, @@ -137,6 +161,10 @@ pinctrl_handle_arp(const struct flow *ip_flow, struct ofpbuf *userdata) goto exit; } + struct ds s = DS_EMPTY_INITIALIZER; + ofpacts_format(ofpacts.data, ofpacts.size, &s); + ds_destroy(&s); + struct ofputil_packet_out po = { .packet = dp_packet_data(&packet), .packet_len = dp_packet_size(&packet), @@ -154,7 +182,8 @@ exit: } static void -process_packet_in(const struct ofp_header *msg) +process_packet_in(const struct lport_index *lports, + const struct ofp_header *msg) { static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(1, 5); @@ -183,21 +212,24 @@ process_packet_in(const struct ofp_header *msg) struct flow headers; flow_extract(&packet, &headers); - const struct flow *md = &pin.flow_metadata.flow; switch (ah->opcode) { case ACTION_OPCODE_ARP: - pinctrl_handle_arp(&headers, &userdata); + pinctrl_handle_arp(&headers, &pin.flow_metadata, &userdata); + break; + + case ACTION_OPCODE_PUT_ARP: + pinctrl_handle_put_arp(lports, &pin.flow_metadata.flow, &headers); break; default: - VLOG_WARN_RL(&rl, "unrecognized packet-in command %#"PRIx32, - md->regs[0]); + VLOG_WARN_RL(&rl, "unrecognized packet-in opcode %"PRIu32, ah->opcode); break; } } static void -pinctrl_recv(const struct ofp_header *oh, enum ofptype type) +pinctrl_recv(const struct lport_index *lports, + const struct ofp_header *oh, enum ofptype type) { if (type == OFPTYPE_ECHO_REQUEST) { queue_msg(make_echo_reply(oh)); @@ -209,7 +241,7 @@ pinctrl_recv(const struct ofp_header *oh, enum ofptype type) config.miss_send_len = UINT16_MAX; set_switch_config(swconn, &config); } else if (type == OFPTYPE_PACKET_IN) { - process_packet_in(oh); + process_packet_in(lports, oh); } else if (type != OFPTYPE_ECHO_REPLY && type != OFPTYPE_BARRIER_REPLY) { if (VLOG_IS_DBG_ENABLED()) { static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(30, 300); @@ -223,7 +255,8 @@ pinctrl_recv(const struct ofp_header *oh, enum ofptype type) } void -pinctrl_run(const struct ovsrec_bridge *br_int) +pinctrl_run(struct controller_ctx *ctx, const struct lport_index *lports, + const struct ovsrec_bridge *br_int) { if (br_int) { char *target; @@ -244,6 +277,7 @@ pinctrl_run(const struct ovsrec_bridge *br_int) if (conn_seq_no != rconn_get_connection_seqno(swconn)) { pinctrl_setup(swconn); conn_seq_no = rconn_get_connection_seqno(swconn); + flush_put_arps(); } /* Process a limited number of messages per call. */ @@ -257,15 +291,18 @@ pinctrl_run(const struct ovsrec_bridge *br_int) enum ofptype type; ofptype_decode(&type, oh); - pinctrl_recv(oh, type); + pinctrl_recv(lports, oh, type); ofpbuf_delete(msg); } } + + run_put_arps(ctx); } void -pinctrl_wait(void) +pinctrl_wait(struct controller_ctx *ctx) { + wait_put_arps(ctx); rconn_run_wait(swconn); rconn_recv_wait(swconn); } @@ -274,4 +311,116 @@ void pinctrl_destroy(void) { rconn_destroy(swconn); + flush_put_arps(); +} + +/* Implementation of the "put_arp" OVN action. This action sends a packet to + * ovn-controller, using the flow as an API (see actions.h for details). This + * code implements the action by updating the MAC_Binding table in the + * southbound database. + * + * This code could be a lot simpler if the database could always be updated, + * but in fact we can only update it when ctx->ovnsb_idl_txn is nonnull. Thus, + * we buffer up a few put_arps (but we don't keep them longer than 1 second) + * and apply them whenever a database transaction is available. */ + +/* Buffered "put_arp" operations. */ +struct put_arp { + long long int timestamp; /* In milliseconds. */ + char *logical_port; + ovs_be32 ip; + struct eth_addr mac; +}; +static struct put_arp put_arps[1024]; +static size_t n_put_arps; + +static void +pinctrl_handle_put_arp(const struct lport_index *lports, + const struct flow *md, const struct flow *headers) +{ + if (n_put_arps >= ARRAY_SIZE(put_arps)) { + return; + } + + /* Convert logical datapath and logical port key into lport. */ + uint32_t dp_key = ntohll(md->metadata); + uint32_t port_key = md->regs[MFF_LOG_INPORT - MFF_REG0]; + const struct sbrec_port_binding *pb + = lport_lookup_by_key(lports, dp_key, port_key); + if (!pb) { + static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(1, 5); + + VLOG_WARN_RL(&rl, "unknown logical port with datapath %"PRIu32" and " + "port %"PRIu32, dp_key, port_key); + return; + } + + struct put_arp *pa = &put_arps[n_put_arps++]; + pa->timestamp = time_msec(); + pa->logical_port = xstrdup(pb->logical_port); + pa->ip = htonl(md->regs[0]); + pa->mac = headers->dl_src; +} + +static void +flush_put_arps(void) +{ + for (struct put_arp *pa = put_arps; pa < &put_arps[n_put_arps]; pa++) { + free(pa->logical_port); + } + n_put_arps = 0; +} + +static void +run_put_arps(struct controller_ctx *ctx) +{ + if (!ctx->ovnsb_idl_txn) { + return; + } + + for (const struct put_arp *pa = put_arps; pa < &put_arps[n_put_arps]; + pa++) { + if (time_msec() > pa->timestamp + 1000) { + continue; + } + + /* Convert arguments to string form for database. */ + char ip_string[INET_ADDRSTRLEN + 1]; + snprintf(ip_string, sizeof ip_string, IP_FMT, IP_ARGS(pa->ip)); + char mac_string[ETH_ADDR_STRLEN + 1]; + snprintf(mac_string, sizeof mac_string, + ETH_ADDR_FMT, ETH_ADDR_ARGS(pa->mac)); + + /* Check for and update an existing IP-MAC binding for this logical + * port. + * + * XXX This is not very efficient. */ + const struct sbrec_mac_binding *b; + SBREC_MAC_BINDING_FOR_EACH (b, ctx->ovnsb_idl) { + if (!strcmp(b->logical_port, pa->logical_port) + && !strcmp(b->ip, ip_string)) { + if (strcmp(b->mac, mac_string)) { + sbrec_mac_binding_set_mac(b, mac_string); + } + goto next; + } + } + + /* Add new IP-MAC binding for this logical port. */ + b = sbrec_mac_binding_insert(ctx->ovnsb_idl_txn); + sbrec_mac_binding_set_logical_port(b, pa->logical_port); + sbrec_mac_binding_set_ip(b, ip_string); + sbrec_mac_binding_set_mac(b, mac_string); + next:; + } + + flush_put_arps(); +} + +static void +wait_put_arps(struct controller_ctx *ctx) +{ + if (ctx->ovnsb_idl_txn && n_put_arps) { + poll_immediate_wake(); + } } diff --git a/ovn/controller/pinctrl.h b/ovn/controller/pinctrl.h index fca6b52..945e76b 100644 --- a/ovn/controller/pinctrl.h +++ b/ovn/controller/pinctrl.h @@ -1,3 +1,4 @@ + /* Copyright (c) 2015, 2016 Nicira, Inc. * * Licensed under the Apache License, Version 2.0 (the "License"); @@ -20,13 +21,14 @@ #include "meta-flow.h" +struct lport_index; struct ovsrec_bridge; struct controller_ctx; -/* Interface for OVN main loop. */ void pinctrl_init(void); -void pinctrl_run(const struct ovsrec_bridge *br_int); -void pinctrl_wait(void); +void pinctrl_run(struct controller_ctx *, const struct lport_index *, + const struct ovsrec_bridge *br_int); +void pinctrl_wait(struct controller_ctx *); void pinctrl_destroy(void); #endif /* ovn/pinctrl.h */ diff --git a/ovn/lib/actions.c b/ovn/lib/actions.c index c1583e4..83eabd9 100644 --- a/ovn/lib/actions.c +++ b/ovn/lib/actions.c @@ -184,6 +184,37 @@ add_prerequisite(struct action_context *ctx, const char *prerequisite) ctx->prereqs = expr_combine(EXPR_T_AND, ctx->prereqs, expr); } +static size_t +start_controller_op(struct ofpbuf *ofpacts, enum action_opcode opcode) +{ + size_t ofs = ofpacts->size; + + struct ofpact_controller *oc = ofpact_put_CONTROLLER(ofpacts); + oc->max_len = UINT16_MAX; + oc->reason = OFPR_ACTION; + + struct action_header ah = { .opcode = opcode }; + ofpbuf_put(ofpacts, &ah, sizeof ah); + + return ofs; +} + +static void +finish_controller_op(struct ofpbuf *ofpacts, size_t ofs) +{ + struct ofpact_controller *oc = ofpbuf_at_assert(ofpacts, ofs, sizeof *oc); + ofpacts->header = oc; + oc->userdata_len = ofpacts->size - (ofs + sizeof *oc); + ofpact_finish(ofpacts, &oc->ofpact); +} + +static void +put_controller_op(struct ofpbuf *ofpacts, enum action_opcode opcode) +{ + size_t ofs = start_controller_op(ofpacts, opcode); + finish_controller_op(ofpacts, ofs); +} + static void parse_arp_action(struct action_context *ctx) { @@ -211,22 +242,10 @@ parse_arp_action(struct action_context *ctx) ctx->ofpacts = outer_ofpacts; /* controller. */ - size_t oc_offset = ctx->ofpacts->size; - ofpact_put_CONTROLLER(ctx->ofpacts); - - struct action_header ah = { .opcode = ACTION_OPCODE_ARP }; - ofpbuf_put(ctx->ofpacts, &ah, sizeof ah); - + size_t oc_offset = start_controller_op(ctx->ofpacts, ACTION_OPCODE_ARP); ofpacts_put_openflow_actions(inner_ofpacts.data, inner_ofpacts.size, ctx->ofpacts, OFP13_VERSION); - - struct ofpact_controller *oc = ofpbuf_at_assert(ctx->ofpacts, oc_offset, - sizeof *oc); - ctx->ofpacts->header = oc; - oc->max_len = UINT16_MAX; - oc->reason = OFPR_ACTION; - oc->userdata_len = ctx->ofpacts->size - (oc_offset + sizeof *oc); - ofpact_finish(ctx->ofpacts, &oc->ofpact); + finish_controller_op(ctx->ofpacts, oc_offset); /* Restore prerequisites. */ expr_destroy(ctx->prereqs); @@ -237,6 +256,158 @@ parse_arp_action(struct action_context *ctx) ofpbuf_uninit(&inner_ofpacts); } +static bool +action_force_match(struct action_context *ctx, enum lex_type t) +{ + if (lexer_match(ctx->lexer, t)) { + return true; + } else { + struct lex_token token = { .type = t }; + struct ds s = DS_EMPTY_INITIALIZER; + lex_token_format(&token, &s); + + action_syntax_error(ctx, "expecting `%s'", ds_cstr(&s)); + + ds_destroy(&s); + + return false; + } +} + +static bool +action_parse_field(struct action_context *ctx, + int n_bits, struct mf_subfield *sf) +{ + struct expr *prereqs; + char *error; + + error = expr_parse_field(ctx->lexer, n_bits, false, ctx->ap->symtab, sf, + &prereqs); + if (error) { + action_error(ctx, "%s", error); + return false; + } + + ctx->prereqs = expr_combine(EXPR_T_AND, ctx->prereqs, prereqs); + return true; +} + +static void +init_stack(struct ofpact_stack *stack, enum mf_field_id field) +{ + stack->subfield.field = mf_from_id(field); + stack->subfield.ofs = 0; + stack->subfield.n_bits = stack->subfield.field->n_bits; +} + +struct arg { + const struct mf_subfield *src; + enum mf_field_id dst; +}; + +static void +setup_args(struct action_context *ctx, + const struct arg args[], size_t n_args) +{ + /* 1. Save all of the destinations that will be modified. */ + for (const struct arg *a = args; a < &args[n_args]; a++) { + ovs_assert(a->src->n_bits == mf_from_id(a->dst)->n_bits); + if (a->src->field->id != a->dst) { + init_stack(ofpact_put_STACK_PUSH(ctx->ofpacts), a->dst); + } + } + + /* 2. Push the sources, in reverse order. */ + for (size_t i = n_args - 1; i < n_args; i--) { + const struct arg *a = &args[i]; + if (a->src->field->id != a->dst) { + ofpact_put_STACK_PUSH(ctx->ofpacts)->subfield = *a->src; + } + } + + /* 3. Pop the sources into the destinations. */ + for (const struct arg *a = args; a < &args[n_args]; a++) { + if (a->src->field->id != a->dst) { + init_stack(ofpact_put_STACK_POP(ctx->ofpacts), a->dst); + } + } +} + +static void +restore_args(struct action_context *ctx, + const struct arg args[], size_t n_args) +{ + for (size_t i = n_args - 1; i < n_args; i--) { + const struct arg *a = &args[i]; + if (a->src->field->id != a->dst) { + init_stack(ofpact_put_STACK_POP(ctx->ofpacts), a->dst); + } + } +} + +static void +put_load(uint64_t value, enum mf_field_id dst, int ofs, int n_bits, + struct ofpbuf *ofpacts) +{ + struct ofpact_set_field *sf = ofpact_put_SET_FIELD(ofpacts); + sf->field = mf_from_id(dst); + sf->flow_has_vlan = false; + + ovs_be64 n_value = htonll(value); + bitwise_copy(&n_value, 8, 0, &sf->value, sf->field->n_bytes, ofs, n_bits); + bitwise_one(&sf->mask, sf->field->n_bytes, ofs, n_bits); +} + +static void +parse_get_arp_action(struct action_context *ctx) +{ + struct mf_subfield port, ip; + + if (!action_force_match(ctx, LEX_T_LPAREN) + || !action_parse_field(ctx, 0, &port) + || !action_force_match(ctx, LEX_T_COMMA) + || !action_parse_field(ctx, 32, &ip) + || !action_force_match(ctx, LEX_T_RPAREN)) { + return; + } + + const struct arg args[] = { + { &port, MFF_LOG_OUTPORT }, + { &ip, MFF_REG0 }, + }; + setup_args(ctx, args, ARRAY_SIZE(args)); + + put_load(0, MFF_ETH_DST, 0, 48, ctx->ofpacts); + emit_resubmit(ctx, ctx->ap->arp_ptable); + + restore_args(ctx, args, ARRAY_SIZE(args)); +} + +static void +parse_put_arp_action(struct action_context *ctx) +{ + struct mf_subfield port, ip, mac; + + if (!action_force_match(ctx, LEX_T_LPAREN) + || !action_parse_field(ctx, 0, &port) + || !action_force_match(ctx, LEX_T_COMMA) + || !action_parse_field(ctx, 32, &ip) + || !action_force_match(ctx, LEX_T_COMMA) + || !action_parse_field(ctx, 48, &mac) + || !action_force_match(ctx, LEX_T_RPAREN)) { + return; + } + + const struct arg args[] = { + { &port, MFF_LOG_INPORT }, + { &ip, MFF_REG0 }, + { &mac, MFF_ETH_SRC } + }; + setup_args(ctx, args, ARRAY_SIZE(args)); + put_controller_op(ctx->ofpacts, ACTION_OPCODE_PUT_ARP); + restore_args(ctx, args, ARRAY_SIZE(args)); +} + static void emit_ct(struct action_context *ctx, bool recirc_next, bool commit) { @@ -295,6 +466,10 @@ parse_action(struct action_context *ctx) emit_ct(ctx, false, true); } else if (lexer_match_id(ctx->lexer, "arp")) { parse_arp_action(ctx); + } else if (lexer_match_id(ctx->lexer, "get_arp")) { + parse_get_arp_action(ctx); + } else if (lexer_match_id(ctx->lexer, "put_arp")) { + parse_put_arp_action(ctx); } else { action_syntax_error(ctx, "expecting action"); } diff --git a/ovn/lib/actions.h b/ovn/lib/actions.h index d8b26d0..b591167 100644 --- a/ovn/lib/actions.h +++ b/ovn/lib/actions.h @@ -34,6 +34,16 @@ enum action_opcode { * The actions, in OpenFlow 1.3 format, follow the action_header. */ ACTION_OPCODE_ARP, + + /* "put_arp(port, ip, mac)" + * + * Arguments are passed through the packet metadata and data, as follows: + * + * MFF_REG0 = ip + * MFF_LOG_INPORT = port + * MFF_ETH_SRC = mac + */ + ACTION_OPCODE_PUT_ARP, }; /* Header. */ @@ -78,6 +88,7 @@ struct action_params { uint8_t first_ptable; /* First OpenFlow table. */ uint8_t cur_ltable; /* 0 <= cur_ltable < n_tables. */ uint8_t output_ptable; /* OpenFlow table for 'output' to resubmit. */ + uint8_t arp_ptable; /* OpenFlow table for 'get_arp' to resubmit. */ }; char *actions_parse(struct lexer *, const struct action_params *, diff --git a/ovn/lib/expr.c b/ovn/lib/expr.c index 44fde84..b2710ab 100644 --- a/ovn/lib/expr.c +++ b/ovn/lib/expr.c @@ -2875,3 +2875,56 @@ expr_parse_assignment(struct lexer *lexer, const struct shash *symtab, *prereqsp = prereqs; return ctx.error; } + +char * +expr_parse_field(struct lexer *lexer, int n_bits, bool rw, + const struct shash *symtab, + struct mf_subfield *sf, struct expr **prereqsp) +{ + struct expr *prereqs = NULL; + struct expr_context ctx; + ctx.lexer = lexer; + ctx.symtab = symtab; + ctx.error = NULL; + ctx.not = false; + + struct expr_field field; + if (!parse_field(&ctx, &field)) { + goto exit; + } + + const struct expr_field orig_field = field; + if (!expand_symbol(&ctx, rw, &field, &prereqs)) { + goto exit; + } + ovs_assert(field.n_bits == orig_field.n_bits); + + if (n_bits != field.n_bits) { + if (n_bits && field.n_bits) { + expr_error(&ctx, "Cannot use %d-bit field %s[%d..%d] " + "where %d-bit field is required.", + orig_field.n_bits, orig_field.symbol->name, + orig_field.ofs, orig_field.ofs + orig_field.n_bits - 1, + n_bits); + } else if (n_bits) { + expr_error(&ctx, "Cannot use string field %s where numeric " + "field is required.", + orig_field.symbol->name); + } else { + expr_error(&ctx, "Cannot use numeric field %s where string " + "field is required.", + orig_field.symbol->name); + } + } + +exit: + if (!ctx.error) { + mf_subfield_from_expr_field(&field, sf); + *prereqsp = prereqs; + } else { + memset(sf, 0, sizeof *sf); + expr_destroy(prereqs); + *prereqsp = NULL; + } + return ctx.error; +} diff --git a/ovn/lib/expr.h b/ovn/lib/expr.h index e4b6f34..de27f09 100644 --- a/ovn/lib/expr.h +++ b/ovn/lib/expr.h @@ -387,5 +387,8 @@ char *expr_parse_assignment(struct lexer *lexer, const struct shash *symtab, unsigned int *portp), const void *aux, struct ofpbuf *ofpacts, struct expr **prereqsp); +char *expr_parse_field(struct lexer *, int n_bits, bool rw, + const struct shash *symtab, struct mf_subfield *, + struct expr **prereqsp); #endif /* ovn/expr.h */ diff --git a/ovn/northd/ovn-northd.8.xml b/ovn/northd/ovn-northd.8.xml index 1b2912e..4685143 100644 --- a/ovn/northd/ovn-northd.8.xml +++ b/ovn/northd/ovn-northd.8.xml @@ -379,12 +379,12 @@ next; <li> <p> - ARP reply. These flows reply to ARP requests for the router's own IP - address. For each router port <var>P</var> that owns IP address - <var>A</var> and Ethernet address <var>E</var>, a priority-90 flow - matches <code>inport == <var>P</var> && arp.tpa == - <var>A</var> && arp.op == 1</code> (ARP request) with the - following actions: + Reply to ARP requests. These flows reply to ARP requests for the + router's own IP address. For each router port <var>P</var> that owns + IP address <var>A</var> and Ethernet address <var>E</var>, a + priority-90 flow matches <code>inport == <var>P</var> && + arp.op == 1 && arp.tpa == <var>A</var></code> (ARP request) + with the following actions: </p> <pre> @@ -402,6 +402,13 @@ output; </li> <li> + ARP reply handling. These flows use ARP replies to populate the + logical router's ARP table. A priority-90 flow with match <code>arp.op + == 2</code> has actions <code>put_arp(inport, arp.spa, + arp.sha);</code>. + </li> + + <li> <p> UDP port unreachable. Priority-80 flows generate ICMP port unreachable messages in reply to UDP datagrams directed to the @@ -525,7 +532,10 @@ icmp4 { to the address in <code>ip4.dst</code>. This table implements IP routing, setting <code>reg0</code> to the next-hop IP address (leaving <code>ip4.dst</code>, the packet's final destination, unchanged) and - advances to the next table for ARP resolution. + advances to the next table for ARP resolution. It also sets + <code>reg1</code> to the IP address owned by the selected router port + (which is used later in table 4 as the IP source address for an ARP + request, if needed). </p> <p> @@ -536,7 +546,9 @@ icmp4 { <li> <p> Routing table. For each route to IPv4 network <var>N</var> with - netmask <var>M</var>, a logical flow with match <code>ip4.dst == + netmask <var>M</var>, on router port <var>P</var> with IP address + <var>A</var> and Ethernet + address <var>E</var>, a logical flow with match <code>ip4.dst == <var>N</var>/<var>M</var></code>, whose priority is the number of 1-bits in <var>M</var>, has the following actions: </p> @@ -544,6 +556,9 @@ icmp4 { <pre> ip.ttl--; reg0 = <var>G</var>; +reg1 = <var>A</var>; +eth.src = <var>E</var>; +outport = <var>P</var>; next; </pre> @@ -602,64 +617,73 @@ icmp4 { <ul> <li> <p> - Known MAC bindings. For each IP address <var>A</var> whose host is - known to have Ethernet address <var>HE</var> and reside on router - port <var>P</var> with Ethernet address <var>PE</var>, a priority-200 - flow with match <code>reg0 == <var>A</var></code> has the following - actions: + Static MAC bindings. MAC bindings can be known statically based on + data in the <code>OVN_Northbound</code> database. For router ports + connected to logical switches, MAC bindings can be known statically + from the <code>addresses</code> column in the + <code>Logical_Port</code> table. For router ports connected to other + logical routers, MAC bindings can be known statically from the + <code>mac</code> and <code>network</code> column in the + <code>Logical_Router_Port</code> table. </p> - <pre> -eth.src = <var>PE</var>; -eth.dst = <var>HE</var>; -outport = <var>P</var>; -output; - </pre> + <p> + For each IP address <var>A</var> whose host is known to have Ethernet + address <var>E</var> on router port <var>P</var>, a priority-100 flow + with match <code>outport === <var>P</var> && reg0 == + <var>A</var></code> has actions <code>eth.dst = <var>E</var>; + next;</code>. + </p> + </li> + <li> <p> - MAC bindings can be known statically based on data in the - <code>OVN_Northbound</code> database. For router ports connected to - logical switches, MAC bindings can be known statically from the - <code>addresses</code> column in the <code>Logical_Port</code> table. - For router ports connected to other logical routers, MAC bindings can - be known statically from the <code>mac</code> and - <code>network</code> column in the <code>Logical_Router_Port</code> - table. + Dynamic MAC bindings. This flows resolves MAC-to-IP bindings that + have become known dynamically through ARP. (The next table will + issue an ARP request for cases where the binding is not yet known.) + </p> + + <p> + A priority-0 logical flow with match <code>1</code> has actions + <code>get_arp(outport, reg0); next;</code>. </p> </li> + </ul> + <h3>Ingress Table 4: ARP Request</h3> + + <p> + In the common case where the Ethernet destination has been resolved, this + table outputs the packet. Otherwise, it composes and sends an ARP + request. It holds the following flows: + </p> + + <ul> <li> <p> - Unknown MAC bindings. For each non-gateway route to IPv4 network - <var>N</var> with netmask <var>M</var> on router port <var>P</var> - that owns IP address <var>A</var> and Ethernet address <var>E</var>, - a logical flow with match <code>ip4.dst == - <var>N</var>/<var>M</var></code>, whose priority is the number of - 1-bits in <var>M</var>, has the following actions: + Unknown MAC address. A priority-100 flow with match <code>eth.dst == + 00:00:00:00:00:00</code> has the following actions: </p> <pre> +rate_limit(outport, ip4.dst); arp { eth.dst = ff:ff:ff:ff:ff:ff; - eth.src = <var>E</var>; - arp.sha = <var>E</var>; - arp.tha = 00:00:00:00:00:00; - arp.spa = <var>A</var>; - arp.tpa = ip4.dst; + arp.spa = reg1; arp.op = 1; /* ARP request. */ - outport = <var>P</var>; output; }; </pre> <p> - TBD: How to install MAC bindings when an ARP response comes back. - (Implement a "learn" action?) + (Ingress table 2 initialized <code>reg1</code> with the IP address + owned by <code>outport</code>.) </p> + </li> - <p> - Not yet implemented. - </p> + <li> + Known MAC address. A priority-0 flow with match <code>1</code> has + actions <code>output;</code>. </li> </ul> diff --git a/ovn/northd/ovn-northd.c b/ovn/northd/ovn-northd.c index e6271cf..d511b4d 100644 --- a/ovn/northd/ovn-northd.c +++ b/ovn/northd/ovn-northd.c @@ -99,7 +99,8 @@ enum ovn_stage { PIPELINE_STAGE(ROUTER, IN, ADMISSION, 0, "lr_in_admission") \ PIPELINE_STAGE(ROUTER, IN, IP_INPUT, 1, "lr_in_ip_input") \ PIPELINE_STAGE(ROUTER, IN, IP_ROUTING, 2, "lr_in_ip_routing") \ - PIPELINE_STAGE(ROUTER, IN, ARP, 3, "lr_in_arp") \ + PIPELINE_STAGE(ROUTER, IN, ARP_RESOLVE, 3, "lr_in_arp_resolve") \ + PIPELINE_STAGE(ROUTER, IN, ARP_REQUEST, 4, "lr_in_arp_request") \ \ /* Logical router egress stages. */ \ PIPELINE_STAGE(ROUTER, OUT, DELIVERY, 0, "lr_out_delivery") @@ -240,6 +241,7 @@ struct ovn_datapath { struct ovs_list list; /* In list of similar records. */ /* Logical router data (digested from nbr). */ + const struct ovn_port *gateway_port; ovs_be32 gateway; /* Logical switch data. */ @@ -389,17 +391,18 @@ join_datapaths(struct northd_context *ctx, struct hmap *datapaths, od->gateway = 0; if (nbr->default_gw) { - ovs_be32 ip, mask; - char *error = ip_parse_masked(nbr->default_gw, &ip, &mask); - if (error || !ip || mask != OVS_BE32_MAX) { - static struct vlog_rate_limit rl - = VLOG_RATE_LIMIT_INIT(5, 1); + ovs_be32 ip; + if (!ip_parse(nbr->default_gw, &ip) || !ip) { + static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(5, 1); VLOG_WARN_RL(&rl, "bad 'gateway' %s", nbr->default_gw); - free(error); } else { od->gateway = ip; } } + + /* Set the gateway port to NULL. If there is a gateway, it will get + * filled in as we go through the ports later. */ + od->gateway_port = NULL; } } @@ -618,6 +621,18 @@ join_logical_ports(struct northd_context *ctx, op->mac = mac; op->od = od; + + /* If 'od' has a gateway and 'op' routes to it... */ + if (od->gateway && !((op->network ^ od->gateway) & op->mask)) { + /* ...and if 'op' is a longer match than the current + * choice... */ + const struct ovn_port *gw = od->gateway_port; + int len = gw ? ip_count_cidr_bits(gw->mask) : 0; + if (ip_count_cidr_bits(op->mask) > len) { + /* ...then it's the default gateway port. */ + od->gateway_port = op; + } + } } } } @@ -1338,7 +1353,7 @@ lrport_is_enabled(const struct nbrec_logical_router_port *lrport) } static void -add_route(struct hmap *lflows, struct ovn_datapath *od, +add_route(struct hmap *lflows, const struct ovn_port *op, ovs_be32 network, ovs_be32 mask, ovs_be32 gateway) { char *match = xasprintf("ip4.dst == "IP_FMT"/"IP_FMT, @@ -1351,11 +1366,17 @@ add_route(struct hmap *lflows, struct ovn_datapath *od, } else { ds_put_cstr(&actions, "ip4.dst"); } - ds_put_cstr(&actions, "; next;"); + ds_put_format(&actions, + "; " + "reg1 = "IP_FMT"; " + "eth.src = "ETH_ADDR_FMT"; " + "outport = %s; " + "next;", + IP_ARGS(op->ip), ETH_ADDR_ARGS(op->mac), op->json_key); /* The priority here is calculated to implement longest-prefix-match * routing. */ - ovn_lflow_add(lflows, od, S_ROUTER_IN_IP_ROUTING, + ovn_lflow_add(lflows, op->od, S_ROUTER_IN_IP_ROUTING, count_1bits(ntohl(mask)), match, ds_cstr(&actions)); ds_destroy(&actions); free(match); @@ -1420,6 +1441,11 @@ build_lrouter_flows(struct hmap *datapaths, struct hmap *ports, "ip4.dst == 0.0.0.0/8", "drop;"); + /* ARP reply handling. Use ARP replies to populate the logical + * router's ARP table. */ + ovn_lflow_add(lflows, od, S_ROUTER_IN_IP_INPUT, 90, "arp.op == 2", + "put_arp(inport, arp.spa, arp.sha);"); + /* Drop Ethernet local broadcast. By definition this traffic should * not be forwarded.*/ ovn_lflow_add(lflows, od, S_ROUTER_IN_IP_INPUT, 50, @@ -1509,23 +1535,24 @@ build_lrouter_flows(struct hmap *datapaths, struct hmap *ports, /* Logical router ingress table 2: IP Routing. * * A packet that arrives at this table is an IP packet that should be - * routed to the address in ip4.dst. This table sets reg0 to the next-hop - * IP address (leaving ip4.dst, the packet’s final destination, unchanged) - * and advances to the next table for ARP resolution. */ + * routed to the address in ip4.dst. This table sets outport to the correct + * output port, eth.src to the output port's MAC address, and reg0 to the + * next-hop IP address (leaving ip4.dst, the packet’s final destination, + * unchanged), and advances to the next table for ARP resolution. */ HMAP_FOR_EACH (op, key_node, ports) { if (!op->nbr) { continue; } - add_route(lflows, op->od, op->network, op->mask, 0); + add_route(lflows, op, op->network, op->mask, 0); } HMAP_FOR_EACH (od, key_node, datapaths) { if (!od->nbr) { continue; } - if (od->gateway) { - add_route(lflows, od, 0, 0, od->gateway); + if (od->gateway && od->gateway_port) { + add_route(lflows, od->gateway_port, 0, 0, od->gateway); } } /* XXX destination unreachable */ @@ -1568,16 +1595,15 @@ build_lrouter_flows(struct hmap *datapaths, struct hmap *ports, continue; } - char *match = xasprintf("reg0 == "IP_FMT, IP_ARGS(ip)); - char *actions = xasprintf("eth.src = "ETH_ADDR_FMT"; " - "eth.dst = "ETH_ADDR_FMT"; " - "outport = %s; " - "output;", - ETH_ADDR_ARGS(peer->mac), - ETH_ADDR_ARGS(ea), - peer->json_key); + char *match = xasprintf( + "outport == %s && reg0 == "IP_FMT, + peer->json_key, IP_ARGS(ip)); + char *actions = xasprintf("eth.dst = "ETH_ADDR_FMT"; " + "next;", + ETH_ADDR_ARGS(ea)); ovn_lflow_add(lflows, peer->od, - S_ROUTER_IN_ARP, 200, match, actions); + S_ROUTER_IN_ARP_RESOLVE, + 100, match, actions); free(actions); free(match); break; @@ -1586,6 +1612,35 @@ build_lrouter_flows(struct hmap *datapaths, struct hmap *ports, } } } + HMAP_FOR_EACH (od, key_node, datapaths) { + if (!od->nbr) { + continue; + } + + ovn_lflow_add(lflows, od, S_ROUTER_IN_ARP_RESOLVE, 0, "1", + "get_arp(outport, reg0); next;"); + } + + /* Local router ingress table 4: ARP request. + * + * In the common case where the Ethernet destination has been resolved, + * this table outputs the packet (priority 100). Otherwise, it composes + * and sends an ARP request (priority 0). */ + HMAP_FOR_EACH (od, key_node, datapaths) { + if (!od->nbr) { + continue; + } + + ovn_lflow_add(lflows, od, S_ROUTER_IN_ARP_REQUEST, 100, + "eth.dst == 00:00:00:00:00:00", + "arp { " + "eth.dst = ff:ff:ff:ff:ff:ff; " + "arp.spa = reg1; " + "arp.op = 1; " /* ARP request */ + "output; " + "};"); + ovn_lflow_add(lflows, od, S_ROUTER_IN_ARP_REQUEST, 0, "1", "output;"); + } /* Logical router egress table 0: Delivery (priority 100). * diff --git a/ovn/ovn-architecture.7.xml b/ovn/ovn-architecture.7.xml index d539db8..a25dad6 100644 --- a/ovn/ovn-architecture.7.xml +++ b/ovn/ovn-architecture.7.xml @@ -733,32 +733,62 @@ <code>ovn-controller</code>'s job is to translate them into equivalent OpenFlow (in particular it translates the table numbers: <code>Logical_Flow</code> tables 0 through 15 become OpenFlow tables 16 - through 31). For a given packet, the logical ingress pipeline - eventually executes zero or more <code>output</code> actions: + through 31). </p> - <ul> - <li> - If the pipeline executes no <code>output</code> actions at all, the - packet is effectively dropped. - </li> - - <li> - Most commonly, the pipeline executes one <code>output</code> action, - which <code>ovn-controller</code> implements by resubmitting the - packet to table 32. - </li> - - <li> - If the pipeline can execute more than one <code>output</code> action, - then each one is separately resubmitted to table 32. This can be - used to send multiple copies of the packet to multiple ports. (If - the packet was not modified between the <code>output</code> actions, - and some of the copies are destined to the same hypervisor, then - using a logical multicast output port would save bandwidth between - hypervisors.) - </li> - </ul> + <p> + Most OVN actions have fairly obvious implementations in OpenFlow (with + OVS extensions), e.g. <code>next;</code> is implemented as + <code>resubmit</code>, <code><var>field</var> = + <var>constant</var>;</code> as <code>set_field</code>. A few are worth + describing in more detail: + </p> + + <dl> + <dt><code>output:</code></dt> + <dd> + Implemented by resubmitting the packet to table 32. If the pipeline + executes more than one <code>output</code> action, then each one is + separately resubmitted to table 32. This can be used to send + multiple copies of the packet to multiple ports. (If the packet was + not modified between the <code>output</code> actions, and some of the + copies are destined to the same hypervisor, then using a logical + multicast output port would save bandwidth between hypervisors.) + </dd> + + <dt><code>get_arp(<var>P</var>, <var>A</var>);</code></dt> + <dd> + <p> + Implemented by storing arguments into OpenFlow fields, then + resubmitting to table 65, which <code>ovn-controller</code> + populates with flows generated from the <code>MAC_Binding</code> + table in the OVN Southbound database. If there is a match in table + 65, then its actions store the bound MAC in the Ethernet + destination address field. + </p> + + <p> + (The OpenFlow actions save and restore the OpenFlow fields used for + the arguments, so that the OVN actions do not have to be aware of + this temporary use.) + </p> + </dd> + + <dt><code>put_arp(<var>P</var>, <var>A</var>, <var>E</var>);</code></dt> + <dd> + <p> + Implemented by storing the arguments into OpenFlow fields, then + outputting a packet to <code>ovn-controller</code>, which updates + the <code>MAC_Binding</code> table. + </p> + + <p> + (The OpenFlow actions save and restore the OpenFlow fields used for + the arguments, so that the OVN actions do not have to be aware of + this temporary use.) + </p> + </dd> + </dl> </li> <li> diff --git a/ovn/ovn-sb.ovsschema b/ovn/ovn-sb.ovsschema index a9a91e5..ead733b 100644 --- a/ovn/ovn-sb.ovsschema +++ b/ovn/ovn-sb.ovsschema @@ -1,7 +1,7 @@ { "name": "OVN_Southbound", - "version": "1.0.0", - "cksum": "1392129391 5060", + "version": "1.1.0", + "cksum": "1223981720 5320", "tables": { "Chassis": { "columns": { @@ -99,6 +99,11 @@ "min": 0, "max": "unlimited"}}}, "indexes": [["datapath", "tunnel_key"], ["logical_port"]], - "isRoot": true} - } -} + "isRoot": true}, + "MAC_Binding": { + "columns": { + "logical_port": {"type": "string"}, + "ip": {"type": "string"}, + "mac": {"type": "string"}}, + "indexes": [["logical_port", "ip"]], + "isRoot": true}}} diff --git a/ovn/ovn-sb.xml b/ovn/ovn-sb.xml index dd248b4..2d433d6 100644 --- a/ovn/ovn-sb.xml +++ b/ovn/ovn-sb.xml @@ -17,7 +17,7 @@ <h2>Database Structure</h2> <p> - The OVN Southbound database contains three classes of data with + The OVN Southbound database contains classes of data with different properties, as described in the sections below. </p> @@ -77,17 +77,17 @@ data. </p> - <h3>Bindings data</h3> + <h3>Logical-physical bindings</h3> <p> - Bindings data link logical and physical components. They show the current + These tables link logical and physical components. They show the current placement of logical components (such as VMs and VIFs) onto chassis, and map logical entities to the values that represent them in tunnel encapsulations. </p> <p> - Bindings change frequently, at least every time a VM powers up or down + These tables change frequently, at least every time a VM powers up or down or migrates, and especially quickly in a container environment. The amount of data per VM (or VIF) is small. </p> @@ -103,6 +103,17 @@ contain binding data. </p> + <h3>MAC bindings</h3> + + <p> + The <ref table="MAC_Binding"/> table tracks the bindings from IP addresses + to Ethernet addresses that are dynamically discovered using ARP (for IPv4) + and neighbor discovery (for IPv6). Usually, IP-to-MAC bindings for virtual + machines are statically populated into the <ref table="Port_Binding"/> + table, so <ref table="MAC_Binding"/> is primarily used to discover bindings + on physical networks. + </p> + <h2>Common Columns</h2> <p> @@ -942,6 +953,43 @@ <p><b>Prerequisite:</b> <code>ip4</code></p> </dd> + <dt><code>get_arp(<var>P</var>, <var>A</var>);</code></dt> + + <dd> + <p> + <b>Parameters</b>: logical port string field <var>P</var>, 32-bit + IP address field <var>A</var>. + </p> + + <p> + Looks up <var>A</var> in <var>P</var>'s ARP table. If an entry is + found, stores its Ethernet address in <code>eth.dst</code>, + otherwise stores <code>00:00:00:00:00:00</code> in + <code>eth.dst</code>. + </p> + + <p><b>Example:</b> <code>get_arp(outport, ip4.dst);</code></p> + </dd> + + <dt> + <code>put_arp(<var>P</var>, <var>A</var>, <var>E</var>);</code> + </dt> + + <dd> + <p> + <b>Parameters</b>: logical port string field <var>P</var>, 32-bit + IP address field <var>A</var>, 48-bit Ethernet address field + <var>E</var>. + </p> + + <p> + Adds or updates the entry for IP address <var>A</var> in logical + port <var>P</var>'s ARP table, setting its Ethernet address to + <var>E</var>. + </p> + + <p><b>Example:</b> <code>put_arp(inport, arp.spa, arp.sha);</code></p> + </dd> </dl> <p> @@ -1351,4 +1399,85 @@ tcp.flags = RST; </column> </group> </table> + + <table name="MAC_Binding" title="IP to MAC bindings"> + <p> + Each row in this table specifies a binding from an IP address to an + Ethernet address that has been discovered through ARP (for IPv4) or + neighbor discovery (for IPv6). This table is primarily used to discover + bindings on physical networks, because IP-to-MAC bindings for virtual + machines are usually populated statically into the <ref + table="Port_Binding"/> table. + </p> + + <p> + This table expresses a functional relationship: <ref + table="MAC_Binding"/>(<ref column="logical_port"/>, <ref column="ip"/>) = + <ref column="mac"/>. + </p> + + <p> + In outline, the lifetime of a logical router's MAC binding looks like + this: + </p> + + <ol> + <li> + On hypervisor 1, a logical router determines that a packet should be + forwarded to IP address <var>A</var> on one of its router ports. It + uses its logical flow table to determine that <var>A</var> lacks a + static IP-to-MAC binding and the <code>get_arp</code> action to + determine that it lacks a dynamic IP-to-MAC binding. + </li> + + <li> + Using an OVN logical <code>arp</code> action, the logical router + generates and sends a broadcast ARP request to the router port. It + drops the IP packet. + </li> + + <li> + The logical switch attached to the router port delivers the ARP request + to all of its ports. (It might make sense to deliver it only to ports + that have no static IP-to-MAC bindings, but this could also be + surprising behavior.) + </li> + + <li> + A host or VM on hypervisor 2 (which might be the same as hypervisor 1) + attached to the logical switch owns the IP address in question. It + composes an ARP reply and unicasts it to the logical router port's + Ethernet address. + </li> + + <li> + The logical switch delivers the ARP reply to the logical router port. + </li> + + <li> + The logical router flow table executes a <code>put_arp</code> action. + To record the IP-to-MAC binding, <code>ovn-controller</code> adds a row + to the <ref table="MAC_Binding"/> table. + </li> + + <li> + On hypervisor 1, <code>ovn-controller</code> receives the updated <ref + table="MAC_Binding"/> table from the OVN southbound database. The next + packet destined to <var>A</var> through the logical router is sent + directly to the bound Ethernet address. + </li> + </ol> + + <column name="logical_port"> + The logical port on which the binding was discovered. + </column> + + <column name="ip"> + The bound IP address. + </column> + + <column name="mac"> + The Ethernet address to which the IP is bound. + </column> + </table> </database> diff --git a/ovn/utilities/ovn-sbctl.c b/ovn/utilities/ovn-sbctl.c index b9e3c10..0f402cd 100644 --- a/ovn/utilities/ovn-sbctl.c +++ b/ovn/utilities/ovn-sbctl.c @@ -771,6 +771,10 @@ static const struct ctl_table_class tables[] = { {{&sbrec_table_port_binding, &sbrec_port_binding_col_logical_port, NULL}, {NULL, NULL, NULL}}}, + {&sbrec_table_mac_binding, + {{&sbrec_table_mac_binding, &sbrec_mac_binding_col_logical_port, NULL}, + {NULL, NULL, NULL}}}, + {NULL, {{NULL, NULL, NULL}, {NULL, NULL, NULL}}} }; diff --git a/tests/ovn.at b/tests/ovn.at index 5e49767..e2dde69 100644 --- a/tests/ovn.at +++ b/tests/ovn.at @@ -510,6 +510,21 @@ ct_commit; => actions=ct(commit,zone=NXM_NX_REG5[0..15]), prereqs=ip # arp arp { eth.dst = ff:ff:ff:ff:ff:ff; output; }; => actions=controller(userdata=00.00.00.00.00.00.00.00.00.19.00.10.80.00.06.06.ff.ff.ff.ff.ff.ff.00.00.ff.ff.00.10.00.00.23.20.00.0e.ff.f8.40.00.00.00), prereqs=ip4 +# get_arp +get_arp(outport, ip4.dst); => actions=push:NXM_NX_REG0[],push:NXM_OF_IP_DST[],pop:NXM_NX_REG0[],set_field:00:00:00:00:00:00->eth_dst,resubmit(,65),pop:NXM_NX_REG0[], prereqs=eth.type == 0x800 +get_arp(inport, reg0); => actions=push:NXM_NX_REG7[],push:NXM_NX_REG0[],push:OXM_OF_PKT_REG0[32..63],push:NXM_NX_REG6[],pop:NXM_NX_REG7[],pop:NXM_NX_REG0[],set_field:00:00:00:00:00:00->eth_dst,resubmit(,65),pop:NXM_NX_REG0[],pop:NXM_NX_REG7[], prereqs=1 +get_arp; => Syntax error at `;' expecting `('. +get_arp(); => Syntax error at `)' expecting field name. +get_arp(inport); => Syntax error at `)' expecting `,'. +get_arp(inport ip4.dst); => Syntax error at `ip4.dst' expecting `,'. +get_arp(inport, ip4.dst; => Syntax error at `;' expecting `)'. +get_arp(inport, eth.dst); => Cannot use 48-bit field eth.dst[0..47] where 32-bit field is required. +get_arp(inport, outport); => Cannot use string field outport where numeric field is required. +get_arp(reg0, ip4.dst); => Cannot use numeric field reg0 where string field is required. + +# put_arp +put_arp(inport, arp.spa, arp.sha); => actions=push:NXM_NX_REG1[],push:NXM_OF_ETH_SRC[],push:NXM_NX_ARP_SHA[],push:NXM_OF_ARP_SPA[],pop:NXM_NX_REG1[],pop:NXM_OF_ETH_SRC[],push:NXM_NX_REG0[],set_field:0xbd9c9810->reg0,controller(reason=packet_out),pop:NXM_NX_REG0[],pop:NXM_OF_ETH_SRC[],pop:NXM_NX_REG1[], prereqs=eth.type == 0x806 && eth.type == 0x806 + # Contradictionary prerequisites (allowed but not useful): ip4.src = ip6.src[0..31]; => actions=move:NXM_NX_IPV6_SRC[0..31]->NXM_OF_IP_SRC[], prereqs=eth.type == 0x800 && eth.type == 0x86dd ip4.src <-> ip6.src[0..31]; => actions=push:NXM_NX_IPV6_SRC[0..31],push:NXM_OF_IP_SRC[],pop:NXM_NX_IPV6_SRC[0..31],pop:NXM_OF_IP_SRC[], prereqs=eth.type == 0x800 && eth.type == 0x86dd @@ -1078,9 +1093,13 @@ for i in 1 2 3; do ovn-nbctl lswitch-add ls$i for j in 1 2 3; do for k in 1 2 3; do + # Add "unknown" to MAC addresses for lp?11, so packets for + # MAC-IP bindings discovered via ARP later have somewhere to go. + if test $j$k = 11; then unknown=unknown; else unknown=; fi + ovn-nbctl \ -- lport-add ls$i lp$i$j$k \ - -- lport-set-addresses lp$i$j$k "f0:00:00:00:0$i:$j$k 192.168.$i$j.$k" + -- lport-set-addresses lp$i$j$k "f0:00:00:00:0$i:$j$k 192.168.$i$j.$k" $unknown done done done @@ -1170,7 +1189,7 @@ sleep 1 # content has Ethernet destination DST and source SRC (each exactly 12 hex # digits) and Ethernet type ETHTYPE (4 hex digits). The OUTPORTs (zero or # more) list the VIFs on which the packet should be received. INPORT and the -# OUTPORTs are specified as lport numbers, e.g. 11 for vif11. +# OUTPORTs are specified as lport numbers, e.g. 123 for vif123. trim_zeros() { sed 's/\(00\)\{1,\}$//' } @@ -1206,6 +1225,8 @@ test_ip() { } as hv1 ovs-vsctl --columns=name,ofport list interface +as hv1 ovn-sbctl list port_binding +as hv1 ovn-sbctl list datapath_binding as hv1 ovn-sbctl dump-flows as hv1 ovs-ofctl dump-flows br-int @@ -1244,6 +1265,48 @@ for is in 1 2 3; do done done +# 3. Send an IP packet from every logical port to every other subnet, +# to an IP address that does not have a static IP-MAC binding. +# This should generate a broadcast ARP request for the destination +# IP address in the destination subnet. +for is in 1 2 3; do + for js in 1 2 3; do + for ks in 1 2 3; do + s=$is$js$ks + smac=f00000000$s + sip=`ip_to_hex 192 168 $is$js $ks` + for id in 1 2 3; do + for jd in 1 2 3; do + if test $is$js = $id$jd; then + continue + fi + + # Send the packet. + dmac=00000000ff$is$js + # Calculate a 4th octet for the destination that is + # unique per $s, avoids the .1 .2 .3 and .254 IP addresses + # that have static MAC bindings, and fits in the range + # 0-255. + o4=`expr $is '*' 9 + $js '*' 3 + $ks + 10` + dip=`ip_to_hex 192 168 $id$jd $o4` + test_ip $s $smac $dmac $sip $dip + + # Every LP on the destination subnet's lswitch should + # receive the ARP request. + lrmac=00000000ff$id$jd + lrip=`ip_to_hex 192 168 $id$jd 254` + arp=ffffffffffff${lrmac}08060001080006040001${lrmac}${lrip}000000000000${dip} + for jd2 in 1 2 3; do + for kd in 1 2 3; do + echo $arp | trim_zeros >> $id$jd2$kd.expected + done + done + done + done + done + done +done + # test_arp INPORT SHA SPA TPA [REPLY_HA] # # Causes a packet to be received on INPORT. The packet is an ARP @@ -1267,7 +1330,7 @@ test_arp() { local j k for j in 1 2 3; do for k in 1 2 3; do - # 192.168.33.254 is configured to the lswtich patch port for lrp33, + # 192.168.33.254 is configured to the lswitch patch port for lrp33, # so no ARP flooding expected for it. if test $i$j$k != $inport && test $tpa != `ip_to_hex 192 168 33 254`; then echo $request >> $i$j$k.expected @@ -1285,14 +1348,14 @@ test_arp() { # Test router replies to ARP requests from all source ports: # -# 3. Router replies to query for its MAC address from port's own IP address. +# 4. Router replies to query for its MAC address from port's own IP address. # -# 4. Router replies to query for its MAC address from any random IP address +# 5. Router replies to query for its MAC address from any random IP address # in its subnet. # -# 5. Router replies to query for its MAC address from another subnet. +# 6. Router replies to query for its MAC address from another subnet. # -# 6. No reply to query for IP address other than router IP. +# 7. No reply to query for IP address other than router IP. for i in 1 2 3; do for j in 1 2 3; do for k in 1 2 3; do @@ -1301,17 +1364,114 @@ for i in 1 2 3; do rip=`ip_to_hex 192 168 $i$j 254` # Router IP rmac=00000000ff$i$j # Router MAC otherip=`ip_to_hex 192 168 $i$j 55` # Some other IP in subnet - test_arp $i$j$k $smac $sip $rip $rmac #3 - test_arp $i$j$k $smac $otherip $rip $rmac #4 - test_arp $i$j$k $smac 0a123456 $rip $rmac #5 - test_arp $i$j$k $smac $sip $otherip #6 + test_arp $i$j$k $smac $sip $rip $rmac #4 + test_arp $i$j$k $smac $otherip $rip $rmac #5 + test_arp $i$j$k $smac 0a123456 $rip $rmac #6 + test_arp $i$j$k $smac $sip $otherip #7 done done done + # Allow some time for packet forwarding. # XXX This can be improved. sleep 1 +# 8. Generate an ARP reply for each of the IP addresses ARPed for +# earlier as #3. +# +# Here, the $s is the VIF that originated the ARP request and $d is +# the VIF that sends the ARP reply, which is somewhat backward but +# it means that $s and $d are the same as #3. +: > mac_bindings.expected +for is in 1 2 3; do + for js in 1 2 3; do + for ks in 1 2 3; do + s=$is$js$ks + for id in 1 2 3; do + for jd in 1 2 3; do + if test $is$js = $id$jd; then + continue + fi + + kd=1 + d=$id$jd$kd + + o4=`expr $is '*' 9 + $js '*' 3 + $ks + 10` + host_ip=`ip_to_hex 192 168 $id$jd $o4` + host_mac=8000000000$o4 + + lrmac=00000000ff$id$jd + lrip=`ip_to_hex 192 168 $id$jd 254` + + arp=${lrmac}${host_mac}08060001080006040002${host_mac}${host_ip}${lrmac}${lrip} + + echo + echo + echo + hv=hv`vif_to_hv $d` + as $hv ovs-appctl netdev-dummy/receive vif$d $arp + #as $hv ovs-appctl ofproto/trace br-int in_port=$d $arp + #as $hv ovs-ofctl dump-flows br-int table=19 + + host_ip_pretty=192.168.$id$jd.$o4 + host_mac_pretty=80:00:00:00:00:$o4 + echo lrp$id$jd,$host_ip_pretty,$host_mac_pretty >> mac_bindings.expected + done + done + done + done +done + +# Allow some time for packet forwarding. +# XXX This can be improved. +sleep 1 + +# 9. Send an IP packet from every logical port to every other subnet. These +# are the same packets already sent as #3, but now the destinations' IP-MAC +# bindings have been discovered via ARP, so instead of provoking an ARP +# request, these packets now get routed to their destinations (which don't +# have static MAC bindings, so they go to the port we've designated as +# accepting "unknown" MACs.) +for is in 1 2 3; do + for js in 1 2 3; do + for ks in 1 2 3; do + s=$is$js$ks + smac=f00000000$s + sip=`ip_to_hex 192 168 $is$js $ks` + for id in 1 2 3; do + for jd in 1 2 3; do + if test $is$js = $id$jd; then + continue + fi + + # Send the packet. + dmac=00000000ff$is$js + # Calculate a 4th octet for the destination that is + # unique per $s, avoids the .1 .2 .3 and .254 IP addresses + # that have static MAC bindings, and fits in the range + # 0-255. + o4=`expr $is '*' 9 + $js '*' 3 + $ks + 10` + dip=`ip_to_hex 192 168 $id$jd $o4` + test_ip $s $smac $dmac $sip $dip + + # Expect the packet egress. + host_mac=8000000000$o4 + outport=${id}11 + out_lrp=$id$jd + echo ${host_mac}00000000ff${out_lrp}08004500001c00000000"3f1101"00${sip}${dip}0035111100080000 | trim_zeros >> $outport.expected + done + done + done + done +done + +# Allow some time for packet forwarding. +# XXX This can be improved. +sleep 1 + +ovn-sbctl -f csv -d bare --no-heading \ + -- --columns=logical_port,ip,mac list mac_binding > mac_bindings + # Now check the packets actually received against the ones expected. for i in 1 2 3; do for j in 1 2 3; do @@ -1325,4 +1485,9 @@ for i in 1 2 3; do done done done + +# Check the MAC bindings against those expected. +AT_CHECK_UNQUOTED([sort < mac_bindings], [0], [`sort < mac_bindings.expected` +]) + AT_CLEANUP diff --git a/tests/test-ovn.c b/tests/test-ovn.c index 4942975..cb4c1d1 100644 --- a/tests/test-ovn.c +++ b/tests/test-ovn.c @@ -1249,6 +1249,7 @@ test_parse_actions(struct ovs_cmdl_context *ctx OVS_UNUSED) .first_ptable = 16, .cur_ltable = 10, .output_ptable = 64, + .arp_ptable = 65, }; error = actions_parse_string(ds_cstr(&input), &ap, &ofpacts, &prereqs); if (!error) { -- 2.1.3 _______________________________________________ dev mailing list dev@openvswitch.org http://openvswitch.org/mailman/listinfo/dev