Hi all We're rolling out a Cassandra cluster on EC2 and I've got a couple if questions about settings. I'm interested to hear what other people have experienced with different values and generally seek advice.
*gcgraceseconds* Currently we configure one setting for all CFs. We experimented with this a bit during testing, including changing from the default (10 days) to 3 hours. Our use case involves lots of rewriting the columns for any given keys. We probably rewrite around 5 million per day. We are thinking of setting this to around 3 days for production so that we don't have old copies of data hanging round. Is there anything obviously wrong with this? Out of curiosity, would there be any performance issues if we had this set to 30 days? My understanding is that it would only affect the amount of disk space used. However Ben Black suggests here that the cleanup will actually only impact data deleted through the API: http://comments.gmane.org/gmane.comp.db.cassandra.user/4437 In this case, I guess that we need not worry too much about the setting since we are actually updating, never deleting. Is this the case? *Replication factor* Our use case is many more writes than reads, but when we do have reads they're random (we're not currently using hadoop to read entire CFs). I'm wondering what sort of level of RF to have for a cluster. We currently have 12 nodes and RF=4. To improve read performance I'm thinking of upping the number of nodes and keeping RF at 4. My understanding is that this means we're sharing the data around more. However it also means a client read to a random node has less chance of actually connecting to one of the nodes with the data on. I'm assuming this is fine. What sort of RFs do others use? With a huge cluster like the recently mentioned 400 node US govt cluster, what sort of RF is sane? On a similar note (read perf), I'm guessing that reading at weak consistency level will bring gains. Gleamed from this slide amongst other places: http://www.slideshare.net/mobile/benjaminblack/introduction-to-cassandra-replication-and-consistency#13 Is this true, or will read repair still hammer disks in all the machines with the data on? Again I guess it's better to have low RF so there are less copied of the data to inspect when doing read repair. Will this result in better read performance? Thanks dave -- *Dave Gardner* Technical Architect [image: imagini_58mmX15mm.png] [image: VisualDNA-Logo-small.png] *Imagini Europe Limited* 7 Moor Street, London W1D 5NB [image: phone_icon.png] +44 20 7734 7033 [image: skype_icon.png] daveg79 [image: emailIcon.png] dave.gard...@imagini.net [image: icon-web.png] http://www.visualdna.com Imagini Europe Limited, Company number 5565112 (England and Wales), Registered address: c/o Bird & Bird, 90 Fetter Lane, London, EC4A 1EQ, United Kingdom