It's a good idea to increase phi_convict_threshold to at least 12 on EC2. Using placement groups and single-tenant systems will certainly help.
Another optimization would be dedicating an Enhanced Network Interface ( http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-eni.html) specifically for gossip traffic. On Mon, May 19, 2014 at 1:36 PM, Phil Burress <philburress...@gmail.com>wrote: > Has anyone experienced network i/o issues with ec2? We are seeing a lot of > these in our logs: > > HintedHandOffManager.java (line 477) Timed out replaying hints to > /10.0.x.xxx; aborting (15 delivered) > > and these... > > Cannot handshake version with /10.0.x.xxx > > and these... > > java.io.IOException: Cannot proceed on repair because a neighbor > (/10.0.x.xxx) is dead: session failed > > Occurs on all of our nodes. Even though in all cases, the host that is > being reported as down or unavailable is up and readily 'pingable'. > > We are using shared tenancy on all our nodes (instance type m1.xlarge) > with cassandra 2.0.7. Any suggestions on how to debug these errors? > > Is there a recommendation to move to Placement Groups for Cassandra? > > Thanks! > > Phil > -- ----------------- Nate McCall Austin, TX @zznate Co-Founder & Sr. Technical Consultant Apache Cassandra Consulting http://www.thelastpickle.com