On Sun, Sep 25, 2011 at 12:52 PM, Yang <teddyyyy...@gmail.com> wrote: > Thanks Brandon. > > I suspected that, but I think that's precluded as a possibility since > I setup another background job to do > echo | nc other_box 7000 > in a loop, > this job seems to be working fine all the time, so network seems fine.
This isn't measuring latency, however. That is how the failure detector works, using probability to estimate the likelihood that a given host is alive, based on previous history. The situation on ec2 is something like the following: 99% of pings are 1ms, but sometimes there are brief periods of 100ms, and this is where the FD says "this is not realistic, I think the host is dead" but then receives the ping, and thus the flapping. I've seen it a million times, increasing the phi threshold always solves it. -Brandon