On Tue, Mar 12, 2013 at 07:44:10PM -0700, Ethan Jackson wrote: > Traditionally, Open vSwitch has used a variant of 802.1ag "CFM" for > interface liveness detection. This has served us well until now, > but has several serious drawbacks which have steadily become more > inconvenient. First, the 802.1ag standard does not implement > several useful features forcing us to (optionally) break > compatibility. Second, 802.1.ag is not particularly popular > outside of carrier grade networking equipment. Third, 802.1ag is > simply quite awkward. > > In an effort to solve the aforementioned problems, this patch > implements BFD which is ubiquitous, well designed, straight > forward, and implements required features in a standard way. The > initial cut of the protocol focuses on getting the basics of the > specification correct, leaving performance optimizations, and > advanced features as future work. The protocol should be > considered experimental pending future testing. > > Signed-off-by: Ethan Jackson <et...@nicira.com>
bfd.c ----- > +#include <config.h> > +#include <bfd.h> We usually use double quotes ("bfd.h") for our own headers. > +/* XXX Finish Experimnetal BFD support. What part is experimental? (The protocol or the implementation?) > + * - CFM random VLAN option equivalent. Instead of vlan, randomly modulate > the > + * UDP source port to get the required entropy. Ugh. Do we really need random anything? I've never liked that "feature". > +#define BFD_VERSION 1 I think the following is microseconds, could you add a comment /* microseconds */, put _USEC in the name, etc.? > +#define OVS_BFD_MIN_RX 200000 I'd be half-inclined to write the states as 0 << 6, 1 << 6, 2 << 6, 3 << 6. But it doesn't really matter: > +enum state { > + STATE_ADMIN_DOWN = 0, > + STATE_DOWN = 1, > + STATE_INIT = 2, > + STATE_UP = 3 > +}; ... > +/* RFC 5880 Section 4.1 I think you left out the line that gives the first digits of the two-digit numbers here. It's harder to read without them: > + * 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 > + * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ > + * |Vers | Diag |Sta|P|F|C|A|D|M| Detect Mult | Length | > + * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ > + * | My Discriminator | > + * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ > + * | Your Discriminator | > + * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ > + * | Desired Min TX Interval | > + * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ > + * | Required Min RX Interval | > + * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ > + * | Required Min Echo RX Interval | > + * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ */ The BFD timers are 32-bit counts of microseconds. The timers in struct bfd are 64-bit and appear to be milliseconds, but I'm not certain that all of them have the same unit. Comments on units would be welcome. Do they need to be 64-bit? ... > +/* Returns a 'smap' of key value pairs representing the status of 'bfd' > + * intended for the OVS database. */ > +void > +bfd_get_status(const struct bfd *bfd, struct smap *smap) > +{ > + smap_add(smap, "forwarding", bfd_forwarding(bfd) ? "true" : "false"); > + smap_add(smap, "state", bfd_state_str(bfd->state)); > + smap_add(smap, "diagnostic", bfd_diag_str(bfd->diag)); > + > + if (bfd->state != STATE_DOWN) { Spaces are OK in the OVSDB protocol but I don't think we use them much. Usually we use hyphen or underscores (probably should standardize that in fact): > + smap_add(smap, "remote state", bfd_state_str(bfd->rmt_state)); > + smap_add(smap, "remote diagnostic", bfd_diag_str(bfd->rmt_diag)); > + } > +} The interface for bfd_configure() is unusual. It does seem to be appropriate. bfd_configure() checks for a key "enable" in the configuration that is passed in, but I don't see any documentation for that key. This might be a personal problem, but I find the following: > + udp_src++; > + udp_src = udp_src < 49152 ? 49152 : udp_src; > + bfd->udp_src = udp_src; a little harder to verify as correct than: bfd->udp_src = (udp_src++ & 16383) + 49152; The following, and similar, bugs me a bit because it can call smap_get_int() twice due to the macro expansion: > + min_tx = MAX(smap_get_int(cfg, "min_tx", 100), 100); > +void > +bfd_run(struct bfd *bfd) > +{ > + if (bfd->state > STATE_DOWN && time_msec() > bfd->detect_time) { The test above should probably be >= bfd->detect_time or we should wait until detect_time + 1 in bfd_wait(), otherwise we could wake up 1 ms too early. > + bfd_set_state(bfd, STATE_DOWN, DIAG_EXPIRED); > + } > + > + if (bfd->min_tx != bfd->cfg_min_tx || bfd->min_rx != bfd->cfg_min_rx) { > + bfd_poll(bfd); > + } > +} ... In bfd_put_packet(), it might be wise to start out with ofpbuf_reserve(p, 2); to properly align what comes afterward. > + eth = ofpbuf_put_uninit(p, sizeof *eth); > + memcpy(eth->eth_dst, eth_addr_broadcast, sizeof eth->eth_dst); > + memcpy(eth->eth_src, eth_src, ETH_ADDR_LEN); It would be nice to use sizeof in both cases or neither, above. > + eth->eth_type = htons(ETH_TYPE_IP); > + > + ip = ofpbuf_put_zeros(p, sizeof *ip); > + ip->ip_ihl_ver = IP_IHL_VER(5, 4); > + ip->ip_tot_len = htons(sizeof *ip + sizeof *udp + sizeof(struct msg)); > + ip->ip_ttl = 1; I understand why we're using ip_ttl == 1, but I wonder whether you considered following RFC 5082 and therefore using (and checking for) ip_ttl == 255. > + ip->ip_proto = IPPROTO_UDP; > + ip->ip_src = htonl(0xA9FE0100); /* 169.254.1.0 Link Local. */ > + ip->ip_dst = htonl(0xA9FE0101); /* 169.254.1.1 Link Local. */ > + ip->ip_csum = csum(ip, sizeof *ip); > + > + udp = ofpbuf_put_zeros(p, sizeof *udp); > + udp->udp_src = htons(bfd->udp_src); > + udp->udp_dst = htons(BFD_DEST_PORT); > + udp->udp_len = htons(sizeof *udp + sizeof(struct msg)); I'd be inclined to write "sizeof *msg" instead of "sizeof(struct msg)" above. > + msg = ofpbuf_put_uninit(p, sizeof(struct msg)); Here too. It looks like bfd_process_packet() expects the caller to have pulled off everything except the L7 payload. It would be nice to put this in a comment. I don't see a check for this requirement from 6.8.6 (I think it only requires comparing p->size to msg->length): If the Length field is greater than the payload of the encapsulating protocol, the packet MUST be discarded. The pseudocode at the end of 6.8.6 only mentions updating bfd.LocalDiag in a couple of cases, but the implementation appears to update LocalDiag in every case where it changes the session state. I don't see anything in bfd_process_packet() that corresponds to: If the packet was not discarded, it has been received for purposes of the Detection Time expiration rules in section 6.8.4. but maybe I just missed it. Perhaps the log message in bfd_poll() should include bfd->name. In bfd_min_tx(), I'd be inclined to remove the outer () here: > + return (bfd->state == STATE_UP ? bfd->min_tx : MAX(bfd->min_tx, 1000)); In bfd_tx_interval(), I'm not wild about putting MAX around a function call here either. In bfd_flag_str(), please use ovs_strlcpy() instead of strncpy(). There is occasionally a good reason to use strncpy(). However: * Using strncpy() into a large buffer can be very inefficient. strncpy() always writes to every byte in the destination buffer, which can waste a lot of time if the destination buffer is much longer than the source string. * If the source string is longer than the size of the destination buffer, then strncpy() doesn't write a terminating null. So a call to strncpy() must be followed by explicitly writing a null terminator at the end of the destination buffer in most cases. bfd_flag_str() could return a const string. > +static void > +log_msg(enum vlog_level level, const struct msg *p, const char *message, > + const struct bfd *bfd) > +{ > + struct ds ds = DS_EMPTY_INITIALIZER; > + > + if (level != VLL_DBG && vlog_should_drop(THIS_MODULE, level, &rl)) { If we really don't want to rate-limit debug messages, then we should at least check for VLOG_IS_DBG_ENABLED() in that case, so that in the common case we don't spend a lot of time composing a message that will just be discarded later. It might be a good idea to rate-limit debug messages anyway, because these days we have an ovs-appctl command to disable rate limits on particular modules. Would there be any value in showing the full times in microseconds? Occasionally I find that every detail is valuable in debugging. The implementation of generate_discriminator() could be expensive if we have thousands of BFD instances (which I know we will have). Would it be worthwhile to use a global discriminator-based hmap instead of a global list to hold the BFD instances? > +static long long int > +msec_diff(long long int earlier, long long int later) { We usually put the { on a line by itself. > + later -= earlier; > + return later < 0 ? 0 : later; > +} The following loses a bit of information by truncating to 0. Did you consider a format like "now%+lld ms" so that you get output like "now+5 ms" or "now-10 ms"? > + ds_put_format(ds, "\tDetect Time: %lldms from now\n", > + msec_diff(time_msec(), bfd->detect_time)); > + ds_put_format(ds, "\tNext TX Time: %lldms from now\n", > + msec_diff(time_msec(), bfd->next_tx)); > + ds_put_format(ds, "\tLast TX Time: %lldms ago\n", > + msec_diff(bfd->last_tx, time_msec())); bfd.h ----- I think that the parameter to bfd_should_send_packet() could be 'const': > +bool bfd_should_send_packet(struct bfd *); odp-util.h ---------- I think this comment can now add SLOW_BFD: /* Mutually exclusive with SLOW_CFM, SLOW_LACP, SLOW_STP. * Could possibly appear with SLOW_IN_BAND. */ ofproto-dpif.c -------------- Why do we now need to call time_refresh() in run_fast()? vswitch.xml ----------- Should we mention RFC 5881 also? "three times the expected reception rate" is the way we configure BFD, I guess, but it's not protocol-specified. "aims comply" => "aims to comply": The Open vSwitch implementation of BFD aims comply faithfully with Many of the column specifications could be more specific. For example: <column name="bfd" key="min_rx" type='{"type": "integer", "minInteger": 1}'> and <column name="bfd_status" key="state"> type='{"type": "string", "enum": ["set", ["UP", "DOWN", "INIT", "ADMIN_DOWN"]]}'> We usually use quotes ``like this'' in vswitch.xml, here: > + omission is "Demand Mode", which we hope to include in future. > + Open vSwitch does not implement the optional Authentication or and here: > + "Echo Mode" features. > + </p> > + <column name="bfd" key="min_rx"> I might say "fastest" instead of "minimum" here: > + The minimum rate, in milliseconds, at which this BFD session is may "be" slower: > + willing to receive BFD control messages. The actual rate may > slower > + if the remote endpoint isn't willing to transmit as quickly as > + specified. Defaults to <code>1000</code>. > + </column> > + > + <column name="bfd" key="min_tx"> "fastest" here? > + The minimum rate, in milliseconds, at which this BFD session is > + willing to transmit BFD control messages. The actual rate may be > + slower if the remote endpoint isn't willing to receive as quickly > as > + specified. Defaults to <code>100</code>. > + </column> > + > + <column name="bfd" key="cpath_down"> The following description doesn't actually tell the user how to use cpath_down. Rephrase slightly? > + Concatenated path down may be used when the local system should not > + have traffic forwarded to it for some reason other than a > connectivty > + failure on the interface being monitored. The local BFD session > will > + notify the remote session of the connectivity problem, at which > time > + the remote session may choose not to forward traffic. Defaults to > + <code>false</code>. > + </column> > + > + <column name="bfd_status" key="state"> Any reason to make the state values all-uppercase? We tend to use lowercase elsewhere in the db. > + State of the BFD session. One of <code>ADMIN_DOWN</code>, > + <code>DOWN</code>, <code>INIT</code>, or <code>UP</code>. > + </column> > + > + <column name="bfd_status" key="forwarding"> > + True if the BFD session believes this <ref table="Interface"/> may > be > + used to forward traffic. Typically this means the local session is "signaling": > + up, and the remote system isn't signalling a problem such as > + concatenated path down. > + </column> Thanks, Ben. _______________________________________________ dev mailing list dev@openvswitch.org http://openvswitch.org/mailman/listinfo/dev