On Wed, Apr 04, 2012 at 11:38:42AM -0700, Mehak Mahajan wrote:
> The changes display the cfm_health of an interface.  The cfm_health
> is an exponential weighted moving average of the health of all
> remote_mpids.  The value can vary from 0 to 100, 100 being very healthy
> and 0 being unhealthy.
> 
> Feature #10363
> Requested-by: Ethan Jackson <et...@nicira.com>
> Signed-off-by: Mehak Mahajan <mmaha...@nicira.com>

Is this "health" percentage something that we invented, or is it an
implementation of a standard, etc.?

It's kind of funny that receiving one healthy heartbeat is 100%
health, but receiving one healthy heartbeat and one marginally healthy
one is "less healthy".

I don't think that we can realistically expect exactly 7 CCM packets
in 7 fault intervals.  It's a matter of luck.  It will be OK, if the
CCM packets arrive in the middle of our fault intervals, like this:

   -------------------------------------------->  time
   X                    X                    X    fault intervals
^     ^     ^     ^     ^     ^     ^     ^     ^ CCM reception

But if reception of CCM packets happens to be aligned closely to the
fault intervals, like this:

   -------------------------------------------->  time
   X                    X                    X    fault intervals
   ^     ^     ^     ^     ^     ^     ^     ^    CCM reception

then we could easily end up receiving 6 CCMs in some intervals and 8
CCMs in other intervals and get a lower "health" score even though the
latter situation is exactly as "healthy" as the former.

I think that the initialization for the algorithm is suboptimal.
Suppose that after the first "health interval" we've received one CCM
from a remote MP.  I'd expect this to be poor health, 1 out of 7
(perhaps 14%), but my reading of the code is that we'd give it 57%
because we average it with an initial value of 100%.  Would it be
better, the first time that the algorithm runs for a given remote MP,
to use the new calculated value without averaging it with any initial
value?

The "weighted moving average" applies only to each individual remote
MP.  If no CCMs are received within a given interval, then the remote
MP will be deleted.  If a remote MP appears in the next interval and
we receive all 7 CCMs from it, then the health will jump up to 100%
instantly, even though that's pretty unrealistic (it couldn't have
been more than 50% in the previous interval assuming there was only
one remote MP).

When a new remote MP appears in the middle of a "health interval", it
will initially get an artificially low health score.  For example, if
a new remote MP appears just before the end of a health interval, it
cannot receive an initial health better than 57%, which may be
deceptive.

I think this comment isn't right:
> +    int health_interval;      /* Num of fault_intervals used to compute the
> +                               * health. */

Some lines use tabs for indentation but should use spaces, e.g.:
> +        if (ccm_rdi) {
> +         fault = true;
> +        }

Please add a space after HMAP_FOR_EACH:
> +            HMAP_FOR_EACH(rmp, node, &cfm->remote_mps) {

I believe the comment below should also say to return -1 if CFM is not
configured:
> +    /* Checks the health of CFM configured on 'ofport'.  Returns an integer
> +     * to indicate the health percentage of the 'ofport' which is an average 
> of
> +     * the health of all the remote_mps.  Returns an integer between 0 and 
> 100
> +     * where 0 means that the 'ofport' is very unhealthy and 100 means the
> +     * 'ofport' is perfectly healthy.
> +     *
> +     * This function may be a null pointer if the ofproto implementation does
> +     * not support CFM. */
> +    int (*get_cfm_health)(const struct ofport *ofport);
> +
>      /* Configures spanning tree protocol (STP) on 'ofproto' using the
>       * settings defined in 's'.
>       *

In iface_refresh_cfm_stats(), you can declare cfm_health in the inner
block that uses it, and the cast to int64_t should not be necessary.

vswitch.xml uses some jargon that users should not need to be
comfortable with, or that we should at least define:

        "3.5/'fault_interval" isn't necessarily meaningful; we didn't
        define "fault interval".  Perhaps "if not received at the
        expected rate"?

        "remote_mpids" isn't defined.  Maybe "all remote MPID listed
        in <ref column="cfm_remote_mpids"/>s"?

I don't think that the health is actually refreshed every 2 seconds in
every case.  That's approximately true for a 300 ms transmission
interval (7 * 300 ms == 2.1 s).

I don't think we need to say "The health of a interface can vary from
0 to 100" because the ovsdb-doc processor already should say that (and
elsewhere we say it's a percentage).

Thanks,

Ben.
_______________________________________________
dev mailing list
dev@openvswitch.org
http://openvswitch.org/mailman/listinfo/dev

Reply via email to