Re: Advice on settings

B. Todd Burruss Thu, 07 Oct 2010 10:27:21 -0700

if you are updating columns quite rapidly, you will scatter the columnsover many sstables as you update them over time. this means that a readof a specific column will require looking at more sstables to find thedata. performing a compaction (using nodetool) will merge the sstablesinto one making your reads more performant. of course the more columns,the more scattering around, the more I/O.

to your point about "sharing the data around". adding more machines isalways a good thing to spread the load - you add RAM, CPU, andpersistent storage to the cluster. there probably is some point whereenough machines creates a lot of network traffic, but 10 or 20 machinesshouldn't be an issue. don't worry about trying to hit a node that hasthe data unless your machines are connected across slow network links.


On 10/07/2010 12:48 AM, Dave Gardner wrote:

Hi all

We're rolling out a Cassandra cluster on EC2 and I've got a couple if
questions about settings. I'm interested to hear what other people
have experienced with different values and generally seek advice.

*gcgraceseconds*

Currently we configure one setting for all CFs. We experimented with
this a bit during testing, including changing from the default (10
days) to 3 hours. Our use case involves lots of rewriting the columns
for any given keys. We probably rewrite around 5 million per day.

We are thinking of setting this to around 3 days for production so
that we don't have old copies of data hanging round. Is there anything
obviously wrong with this? Out of curiosity, would there be any
performance issues if we had this set to 30 days? My understanding is
that it would only affect the amount of disk space used.

However Ben Black suggests here that the cleanup will actually only
impact data deleted through the API:

http://comments.gmane.org/gmane.comp.db.cassandra.user/4437

In this case, I guess that we need not worry too much about the
setting since we are actually updating, never deleting. Is this the
case?

*Replication factor*

Our use case is many more writes than reads, but when we do have reads
they're random (we're not currently using hadoop to read entire CFs).
I'm wondering what sort of level of RF to have for a cluster. We
currently have 12 nodes and RF=4.

To improve read performance I'm thinking of upping the number of nodes
and keeping RF at 4. My understanding is that this means we're sharing
the data around more. However it also means a client read to a random
node has less chance of actually connecting to one of the nodes with
the data on. I'm assuming this is fine. What sort of RFs do others
use? With a huge cluster like the recently mentioned 400 node US govt
cluster, what sort of RF is sane?

On a similar note (read perf), I'm guessing that reading at weak
consistency level will bring gains. Gleamed from this slide amongst
other places:

http://www.slideshare.net/mobile/benjaminblack/introduction-to-cassandra-replication-and-consistency#13

Is this true, or will read repair still hammer disks in all the
machines with the data on? Again I guess it's better to have low RF so
there are less copied of the data to inspect when doing read repair.
Will this result in better read performance?

Thanks

dave

Re: Advice on settings

Reply via email to