Everyone, thank you for the responses Jon, to answer your question we’re using the General Purpose SSD with IOPS of 1500/3000 so based on your definition I guess we’re using the awful ones since they aren’t provisioned IOPS. We’re also trying G1 garbage collection.
I also just looked at our application setting overrides and it appears we are using CL=ONE with RF=2 on both of the DCs. We’ve also disabled durable writes as shown in the keyspace creation statement below - CREATE KEYSPACE reporting WITH replication = {'class': 'NetworkTopologyStrategy', 'us-east_dc1': '2', 'us-east_dc2': '2'} AND durable_writes = false; The main table we’re interacting with has these settings for compaction (These are Akka persistence journal tables) compaction = {'bucket_high': '1.5', 'bucket_low': '0.5', 'class': 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy', 'enabled': 'true', 'max_threshold': '32', 'min_sstable_size': '50', 'min_threshold': '4', 'tombstone_compaction_interval': '86400', 'tombstone_threshold': '0.2', 'unchecked_tombstone_compaction': 'false'} We’re also planning to set a TTL of about 3 hours on the table since we’re using these tables for business continuity so we don’t need the data to persist for long periods. RICHARD NEY TECHNICAL DIRECTOR, RESEARCH & DEVELOPMENT +1 (978) 848.6640 WORK +1 (916) 846.2353 MOBILE UNITED STATES richard....@aspect.com<mailto:richard....@aspect.com> aspect.com<http://www.aspect.com/> [mailSigLogo-rev.jpg] From: Jonathan Haddad <j...@jonhaddad.com> Reply-To: "user@cassandra.apache.org" <user@cassandra.apache.org> Date: Monday, December 26, 2016 at 2:02 PM To: "user@cassandra.apache.org" <user@cassandra.apache.org> Subject: Re: Has anyone deployed a production cluster with less than 6 nodes per DC? There's nothing wrong with running a 3 node DC. A million writes an hour is averaging less than 300 writes a second, which is pretty trivial. Are you running provisioned SSD EBS volumes or the traditional, awful ones? RF=2 with Quorum is kind of pointless, that's the same as CL=ALL. Not recommended. I don't know why your timeouts are happening, but when they do, RF=2 w/ QUORUM is going to make the problem worse. Either use RF=3 or use CL=ONE. Your management is correct here. Throwing more hardware at this problem is the wrong solution given that your current hardware should be able to handle over 100x what it's doing right now. Jon This email (including any attachments) is proprietary to Aspect Software, Inc. and may contain information that is confidential. If you have received this message in error, please do not read, copy or forward this message. Please notify the sender immediately, delete it from your system and destroy any copies. You may not further disclose or distribute this email or its attachments.