On Wed, Apr 04, 2012 at 11:38:42AM -0700, Mehak Mahajan wrote: > The changes display the cfm_health of an interface. The cfm_health > is an exponential weighted moving average of the health of all > remote_mpids. The value can vary from 0 to 100, 100 being very healthy > and 0 being unhealthy. > > Feature #10363 > Requested-by: Ethan Jackson <et...@nicira.com> > Signed-off-by: Mehak Mahajan <mmaha...@nicira.com>
Is this "health" percentage something that we invented, or is it an implementation of a standard, etc.? It's kind of funny that receiving one healthy heartbeat is 100% health, but receiving one healthy heartbeat and one marginally healthy one is "less healthy". I don't think that we can realistically expect exactly 7 CCM packets in 7 fault intervals. It's a matter of luck. It will be OK, if the CCM packets arrive in the middle of our fault intervals, like this: --------------------------------------------> time X X X fault intervals ^ ^ ^ ^ ^ ^ ^ ^ ^ CCM reception But if reception of CCM packets happens to be aligned closely to the fault intervals, like this: --------------------------------------------> time X X X fault intervals ^ ^ ^ ^ ^ ^ ^ ^ CCM reception then we could easily end up receiving 6 CCMs in some intervals and 8 CCMs in other intervals and get a lower "health" score even though the latter situation is exactly as "healthy" as the former. I think that the initialization for the algorithm is suboptimal. Suppose that after the first "health interval" we've received one CCM from a remote MP. I'd expect this to be poor health, 1 out of 7 (perhaps 14%), but my reading of the code is that we'd give it 57% because we average it with an initial value of 100%. Would it be better, the first time that the algorithm runs for a given remote MP, to use the new calculated value without averaging it with any initial value? The "weighted moving average" applies only to each individual remote MP. If no CCMs are received within a given interval, then the remote MP will be deleted. If a remote MP appears in the next interval and we receive all 7 CCMs from it, then the health will jump up to 100% instantly, even though that's pretty unrealistic (it couldn't have been more than 50% in the previous interval assuming there was only one remote MP). When a new remote MP appears in the middle of a "health interval", it will initially get an artificially low health score. For example, if a new remote MP appears just before the end of a health interval, it cannot receive an initial health better than 57%, which may be deceptive. I think this comment isn't right: > + int health_interval; /* Num of fault_intervals used to compute the > + * health. */ Some lines use tabs for indentation but should use spaces, e.g.: > + if (ccm_rdi) { > + fault = true; > + } Please add a space after HMAP_FOR_EACH: > + HMAP_FOR_EACH(rmp, node, &cfm->remote_mps) { I believe the comment below should also say to return -1 if CFM is not configured: > + /* Checks the health of CFM configured on 'ofport'. Returns an integer > + * to indicate the health percentage of the 'ofport' which is an average > of > + * the health of all the remote_mps. Returns an integer between 0 and > 100 > + * where 0 means that the 'ofport' is very unhealthy and 100 means the > + * 'ofport' is perfectly healthy. > + * > + * This function may be a null pointer if the ofproto implementation does > + * not support CFM. */ > + int (*get_cfm_health)(const struct ofport *ofport); > + > /* Configures spanning tree protocol (STP) on 'ofproto' using the > * settings defined in 's'. > * In iface_refresh_cfm_stats(), you can declare cfm_health in the inner block that uses it, and the cast to int64_t should not be necessary. vswitch.xml uses some jargon that users should not need to be comfortable with, or that we should at least define: "3.5/'fault_interval" isn't necessarily meaningful; we didn't define "fault interval". Perhaps "if not received at the expected rate"? "remote_mpids" isn't defined. Maybe "all remote MPID listed in <ref column="cfm_remote_mpids"/>s"? I don't think that the health is actually refreshed every 2 seconds in every case. That's approximately true for a 300 ms transmission interval (7 * 300 ms == 2.1 s). I don't think we need to say "The health of a interface can vary from 0 to 100" because the ovsdb-doc processor already should say that (and elsewhere we say it's a percentage). Thanks, Ben. _______________________________________________ dev mailing list dev@openvswitch.org http://openvswitch.org/mailman/listinfo/dev