Advice on settings

Dave Gardner Thu, 07 Oct 2010 00:49:05 -0700

Hi all

We're rolling out a Cassandra cluster on EC2 and I've got a couple if
questions about settings. I'm interested to hear what other people
have experienced with different values and generally seek advice.

*gcgraceseconds*

Currently we configure one setting for all CFs. We experimented with
this a bit during testing, including changing from the default (10
days) to 3 hours. Our use case involves lots of rewriting the columns
for any given keys. We probably rewrite around 5 million per day.

We are thinking of setting this to around 3 days for production so
that we don't have old copies of data hanging round. Is there anything
obviously wrong with this? Out of curiosity, would there be any
performance issues if we had this set to 30 days? My understanding is
that it would only affect the amount of disk space used.

However Ben Black suggests here that the cleanup will actually only
impact data deleted through the API:

http://comments.gmane.org/gmane.comp.db.cassandra.user/4437

In this case, I guess that we need not worry too much about the
setting since we are actually updating, never deleting. Is this the
case?

*Replication factor*

Our use case is many more writes than reads, but when we do have reads
they're random (we're not currently using hadoop to read entire CFs).
I'm wondering what sort of level of RF to have for a cluster. We
currently have 12 nodes and RF=4.

To improve read performance I'm thinking of upping the number of nodes
and keeping RF at 4. My understanding is that this means we're sharing
the data around more. However it also means a client read to a random
node has less chance of actually connecting to one of the nodes with
the data on. I'm assuming this is fine. What sort of RFs do others
use? With a huge cluster like the recently mentioned 400 node US govt
cluster, what sort of RF is sane?

On a similar note (read perf), I'm guessing that reading at weak
consistency level will bring gains. Gleamed from this slide amongst
other places:

http://www.slideshare.net/mobile/benjaminblack/introduction-to-cassandra-replication-and-consistency#13

Is this true, or will read repair still hammer disks in all the
machines with the data on? Again I guess it's better to have low RF so
there are less copied of the data to inspect when doing read repair.
Will this result in better read performance?

Thanks

dave

--
*Dave Gardner*
Technical Architect

[image: imagini_58mmX15mm.png] [image: VisualDNA-Logo-small.png]

*Imagini Europe Limited*
7 Moor Street, London W1D 5NB

[image: phone_icon.png] +44 20 7734 7033
[image: skype_icon.png] daveg79
[image: emailIcon.png] dave.gard...@imagini.net
[image: icon-web.png] http://www.visualdna.com

Imagini Europe Limited, Company number 5565112 (England
and Wales), Registered address: c/o Bird & Bird,
90 Fetter Lane, London, EC4A 1EQ, United Kingdom

Advice on settings

Reply via email to