Everyone, thank you for the responses
Jon, to answer your question we’re using the General Purpose SSD with IOPS of
1500/3000 so based on your definition I guess we’re using the awful ones since
they aren’t provisioned IOPS. We’re also trying G1 garbage collection.
I also just looked at our application setting overrides and it appears we are
using CL=ONE with RF=2 on both of the DCs. We’ve also disabled durable writes
as shown in the keyspace creation statement below
- CREATE KEYSPACE reporting WITH replication = {'class':
'NetworkTopologyStrategy', 'us-east_dc1': '2', 'us-east_dc2': '2'} AND
durable_writes = false;
The main table we’re interacting with has these settings for compaction (These
are Akka persistence journal tables)
compaction = {'bucket_high': '1.5', 'bucket_low': '0.5', 'class':
'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy', 'enabled':
'true', 'max_threshold': '32', 'min_sstable_size': '50', 'min_threshold': '4',
'tombstone_compaction_interval': '86400', 'tombstone_threshold': '0.2',
'unchecked_tombstone_compaction': 'false'}
We’re also planning to set a TTL of about 3 hours on the table since we’re
using these tables for business continuity so we don’t need the data to persist
for long periods.
RICHARD NEY
TECHNICAL DIRECTOR, RESEARCH & DEVELOPMENT
+1 (978) 848.6640 WORK
+1 (916) 846.2353 MOBILE
UNITED STATES
[email protected]<mailto:[email protected]>
aspect.com<http://www.aspect.com/>
[mailSigLogo-rev.jpg]
From: Jonathan Haddad <[email protected]>
Reply-To: "[email protected]" <[email protected]>
Date: Monday, December 26, 2016 at 2:02 PM
To: "[email protected]" <[email protected]>
Subject: Re: Has anyone deployed a production cluster with less than 6 nodes
per DC?
There's nothing wrong with running a 3 node DC. A million writes an hour is
averaging less than 300 writes a second, which is pretty trivial.
Are you running provisioned SSD EBS volumes or the traditional, awful ones?
RF=2 with Quorum is kind of pointless, that's the same as CL=ALL. Not
recommended. I don't know why your timeouts are happening, but when they do,
RF=2 w/ QUORUM is going to make the problem worse. Either use RF=3 or use
CL=ONE.
Your management is correct here. Throwing more hardware at this problem is the
wrong solution given that your current hardware should be able to handle over
100x what it's doing right now.
Jon
This email (including any attachments) is proprietary to Aspect Software, Inc.
and may contain information that is confidential. If you have received this
message in error, please do not read, copy or forward this message. Please
notify the sender immediately, delete it from your system and destroy any
copies. You may not further disclose or distribute this email or its
attachments.