Thanks Brandon. I suspected that, but I think that's precluded as a possibility since I setup another background job to do echo | nc other_box 7000 in a loop, this job seems to be working fine all the time, so network seems fine.
Yang On Sun, Sep 25, 2011 at 10:39 AM, Brandon Williams <dri...@gmail.com> wrote: > On Sat, Sep 24, 2011 at 4:54 PM, Yang <teddyyyy...@gmail.com> wrote: >> I'm using 1.0.0 >> >> >> there seems to be too many node Up/Dead events detected by the failure >> detector. >> I'm using a 2 node cluster on EC2, in the same region, same security >> group, so I assume the message drop >> rate should be fairly low. >> but in about every 5 minutes, I'm seeing some node detected as down, >> and then Up again quickly > > This is fairly common on ec2 due to wild variance in the network. > Increase your phi_convict_threshold to 10 or higher (but I wouldn't go > over 12, this is roughly an exponential increase) > > -Brandon >