Hi Ben On Wed, Apr 4, 2012 at 2:15 PM, Ben Pfaff <b...@nicira.com> wrote:
> On Wed, Apr 04, 2012 at 11:38:42AM -0700, Mehak Mahajan wrote: > > The changes display the cfm_health of an interface. The cfm_health > > is an exponential weighted moving average of the health of all > > remote_mpids. The value can vary from 0 to 100, 100 being very healthy > > and 0 being unhealthy. > > > > Feature #10363 > > Requested-by: Ethan Jackson <et...@nicira.com> > > Signed-off-by: Mehak Mahajan <mmaha...@nicira.com> > > Is this "health" percentage something that we invented, or is it an > implementation of a standard, etc.? > > I think this is really something that we are using. > It's kind of funny that receiving one healthy heartbeat is 100% > health, but receiving one healthy heartbeat and one marginally healthy > one is "less healthy". > > Actually if we received all the "healthy" heartbeats, only then are we healthy. If we receive even one unhealthy heartbeat we will not be 100% healthy. > I don't think that we can realistically expect exactly 7 CCM packets > in 7 fault intervals. It's a matter of luck. It will be OK, if the > CCM packets arrive in the middle of our fault intervals, like this: > > --------------------------------------------> time > X X X fault intervals > ^ ^ ^ ^ ^ ^ ^ ^ ^ CCM reception > > But if reception of CCM packets happens to be aligned closely to the > fault intervals, like this: > > --------------------------------------------> time > X X X fault intervals > ^ ^ ^ ^ ^ ^ ^ ^ CCM reception > > then we could easily end up receiving 6 CCMs in some intervals and 8 > CCMs in other intervals and get a lower "health" score even though the > latter situation is exactly as "healthy" as the former. > > We are expecting to receive 7 healthy heartbeats every 2 fault intervals. According to the code, a fault interval is defined as 3.5 * cfm_interval (which is configurable. Seems we generally set it to be 300 ms). Essentially what we are suggesting is that we should be receivng a heartbeat every cfm_interval. In order to get to an integral number of heartbeats, instead of considering only 1 fault interval, we are considering two (to get a whole number). I can think of a case where in a perfectly healthy link with 0 packet loss, we get 8 CCM frames. But I am not sure if we can receive less than 7 CCM frames in 2 fault intervals ? I am not sure I I understood the diagram clearly. > I think that the initialization for the algorithm is suboptimal. > Suppose that after the first "health interval" we've received one CCM > from a remote MP. I'd expect this to be poor health, 1 out of 7 > (perhaps 14%), but my reading of the code is that we'd give it 57% > because we average it with an initial value of 100%. Would it be > better, the first time that the algorithm runs for a given remote MP, > to use the new calculated value without averaging it with any initial > value? > Yes. I guess we should start the averaging with 0. I will change that to the health of a link being 0 at intialization. > > The "weighted moving average" applies only to each individual remote > MP. If no CCMs are received within a given interval, then the remote > MP will be deleted. If a remote MP appears in the next interval and > we receive all 7 CCMs from it, then the health will jump up to 100% > instantly, even though that's pretty unrealistic (it couldn't have > been more than 50% in the previous interval assuming there was only > one remote MP). > > When a new remote MP appears in the middle of a "health interval", it > will initially get an artificially low health score. For example, if > a new remote MP appears just before the end of a health interval, it > cannot receive an initial health better than 57%, which may be > deceptive. > > I think this comment isn't right: > > + int health_interval; /* Num of fault_intervals used to compute > the > > + * health. */ > > Some lines use tabs for indentation but should use spaces, e.g.: > > + if (ccm_rdi) { > > + fault = true; > > + } > > Will look into the indentation issues. > Please add a space after HMAP_FOR_EACH: > > + HMAP_FOR_EACH(rmp, node, &cfm->remote_mps) { > > Done. > I believe the comment below should also say to return -1 if CFM is not > configured: > > + /* Checks the health of CFM configured on 'ofport'. Returns an > integer > > + * to indicate the health percentage of the 'ofport' which is an > average of > > + * the health of all the remote_mps. Returns an integer between 0 > and 100 > > + * where 0 means that the 'ofport' is very unhealthy and 100 means > the > > + * 'ofport' is perfectly healthy. > > + * > > + * This function may be a null pointer if the ofproto > implementation does > > + * not support CFM. */ > > + int (*get_cfm_health)(const struct ofport *ofport); > > + > > /* Configures spanning tree protocol (STP) on 'ofproto' using the > > * settings defined in 's'. > > * > > Will modify it. > In iface_refresh_cfm_stats(), you can declare cfm_health in the inner > block that uses it, and the cast to int64_t should not be necessary. > > vswitch.xml uses some jargon that users should not need to be > comfortable with, or that we should at least define: > > "3.5/'fault_interval" isn't necessarily meaningful; we didn't > define "fault interval". Perhaps "if not received at the > expected rate"? > > "remote_mpids" isn't defined. Maybe "all remote MPID listed > in <ref column="cfm_remote_mpids"/>s"? > > I don't think that the health is actually refreshed every 2 seconds in > every case. That's approximately true for a 300 ms transmission > interval (7 * 300 ms == 2.1 s). > > Modified to read "refreshed approximately every 2 seconds". > I don't think we need to say "The health of a interface can vary from > 0 to 100" because the ovsdb-doc processor already should say that (and > elsewhere we say it's a percentage). > > I will cleanup the description a bit as you have suggested. > Thanks, > > Ben. >
_______________________________________________ dev mailing list dev@openvswitch.org http://openvswitch.org/mailman/listinfo/dev