Advice on settings
Hi all We're rolling out a Cassandra cluster on EC2 and I've got a couple if questions about settings. I'm interested to hear what other people have experienced with different values and generally seek advice. *gcgraceseconds* Currently we configure one setting for all CFs. We experimented with this a bit during testing, including changing from the default (10 days) to 3 hours. Our use case involves lots of rewriting the columns for any given keys. We probably rewrite around 5 million per day. We are thinking of setting this to around 3 days for production so that we don't have old copies of data hanging round. Is there anything obviously wrong with this? Out of curiosity, would there be any performance issues if we had this set to 30 days? My understanding is that it would only affect the amount of disk space used. However Ben Black suggests here that the cleanup will actually only impact data deleted through the API: http://comments.gmane.org/gmane.comp.db.cassandra.user/4437 In this case, I guess that we need not worry too much about the setting since we are actually updating, never deleting. Is this the case? *Replication factor* Our use case is many more writes than reads, but when we do have reads they're random (we're not currently using hadoop to read entire CFs). I'm wondering what sort of level of RF to have for a cluster. We currently have 12 nodes and RF=4. To improve read performance I'm thinking of upping the number of nodes and keeping RF at 4. My understanding is that this means we're sharing the data around more. However it also means a client read to a random node has less chance of actually connecting to one of the nodes with the data on. I'm assuming this is fine. What sort of RFs do others use? With a huge cluster like the recently mentioned 400 node US govt cluster, what sort of RF is sane? On a similar note (read perf), I'm guessing that reading at weak consistency level will bring gains. Gleamed from this slide amongst other places: http://www.slideshare.net/mobile/benjaminblack/introduction-to-cassandra-replication-and-consistency#13 Is this true, or will read repair still hammer disks in all the machines with the data on? Again I guess it's better to have low RF so there are less copied of the data to inspect when doing read repair. Will this result in better read performance? Thanks dave -- *Dave Gardner* Technical Architect [image: imagini_58mmX15mm.png] [image: VisualDNA-Logo-small.png] *Imagini Europe Limited* 7 Moor Street, London W1D 5NB [image: phone_icon.png] +44 20 7734 7033 [image: skype_icon.png] daveg79 [image: emailIcon.png] dave.gard...@imagini.net [image: icon-web.png] http://www.visualdna.com Imagini Europe Limited, Company number 5565112 (England and Wales), Registered address: c/o Bird & Bird, 90 Fetter Lane, London, EC4A 1EQ, United Kingdom
Creating and using indices
I'm currently trying to get started on secondary indices in Cassandra 0.7.0svn, but without any luck so far. I have the following code that should create an index on ColA: KsDef ksDef = client.describe_keyspace("MyKeyspace"); > List cfs = ksDef.cf_defs; > String columnFamily = "MyCF"; > for(CfDef cf : ksDef.cf_defs){ > if(cf.getName().equals(columnFamily)){ > ColumnDef cd1 = new ColumnDef("ColA".getBytes(), > "org.apache.cassandra.db.marshal.UTF8Type"); > cd1.index_type = IndexType.KEYS; > cf.column_metadata.add(cd1); > // Write changes back to DB > client.system_update_column_family(cf); > } > } > which seems to work nicely since when turning up the logging level of Cassandra it appears to apply some migrations, but then when I try to use a pycassa client to read an indexed_slice I only get an InvalidRequestException(why='No indexed columns present in index clause with operator EQ'): cf = pycassa.ColumnFamily(client, "MyCF") > ex = pycassa.index.create_index_expression('ColA', '50', > pycassa.index.IndexOperator.LTE) > clause = pycassa.index.create_index_clause([ex]) > cf.get_indexed_slices(clause) > Am I missing something? Regards, Chris
Re: Creating and using indices
If I remember correctly the only operator supported for secondary indexes right now is EQ, not LTE (or the others). On Thu, Oct 7, 2010 at 6:13 AM, Christian Decker wrote: > I'm currently trying to get started on secondary indices in Cassandra > 0.7.0svn, but without any luck so far. I have the following code that should > create an index on ColA: > > KsDef ksDef = client.describe_keyspace("MyKeyspace"); >> List cfs = ksDef.cf_defs; >> String columnFamily = "MyCF"; >> for(CfDef cf : ksDef.cf_defs){ >> if(cf.getName().equals(columnFamily)){ >> ColumnDef cd1 = new ColumnDef("ColA".getBytes(), >> "org.apache.cassandra.db.marshal.UTF8Type"); >> cd1.index_type = IndexType.KEYS; >> cf.column_metadata.add(cd1); >> // Write changes back to DB >> client.system_update_column_family(cf); >> } >> } >> > > which seems to work nicely since when turning up the logging level of > Cassandra it appears to apply some migrations, but then when I try to use a > pycassa client to read an indexed_slice I only get an > InvalidRequestException(why='No indexed columns present in index clause with > operator EQ'): > > cf = pycassa.ColumnFamily(client, "MyCF") >> ex = pycassa.index.create_index_expression('ColA', '50', >> pycassa.index.IndexOperator.LTE) >> clause = pycassa.index.create_index_clause([ex]) >> cf.get_indexed_slices(clause) >> > > Am I missing something? > > Regards, > Chris > -- Riptano Software and Support for Apache Cassandra http://www.riptano.com/ mden...@riptano.com m: 512.587.0900 f: 866.583.2068
Re: Creating secondary indices after startup
This is not in beta2 but will be in 0.7.0 (https://issues.apache.org/jira/browse/CASSANDRA-1532) On Thu, Oct 7, 2010 at 7:30 AM, wrote: > Hello, > > I am trying to work out the new secondary index code on my own, as there > is no documentation. I've seen the 'Cassandra explained' presentation and > the tests related to index queries. What I'd like to know is > > 1) the basics of the secondary index mechanism (a short human-readable > description, as the source is confusing me), and > > 2) how I can go about implementing support for creating indices on a live > cluster (if it can be done) > > Alexander Altanis > -- Jonathan Ellis Project Chair, Apache Cassandra co-founder of Riptano, the source for professional Cassandra support http://riptano.com
Re: Creating and using indices
What I've tested you must include at least one expression with EQ operator On Thu, Oct 7, 2010 at 3:45 PM, Matthew Dennis wrote: > If I remember correctly the only operator supported for secondary indexes > right now is EQ, not LTE (or the others). > > > On Thu, Oct 7, 2010 at 6:13 AM, Christian Decker < > decker.christ...@gmail.com> wrote: > >> I'm currently trying to get started on secondary indices in Cassandra >> 0.7.0svn, but without any luck so far. I have the following code that should >> create an index on ColA: >> >> KsDef ksDef = client.describe_keyspace("MyKeyspace"); >>> List cfs = ksDef.cf_defs; >>> String columnFamily = "MyCF"; >>> for(CfDef cf : ksDef.cf_defs){ >>> if(cf.getName().equals(columnFamily)){ >>> ColumnDef cd1 = new ColumnDef("ColA".getBytes(), >>> "org.apache.cassandra.db.marshal.UTF8Type"); >>> cd1.index_type = IndexType.KEYS; >>> cf.column_metadata.add(cd1); >>> // Write changes back to DB >>> client.system_update_column_family(cf); >>> } >>> } >>> >> >> which seems to work nicely since when turning up the logging level of >> Cassandra it appears to apply some migrations, but then when I try to use a >> pycassa client to read an indexed_slice I only get an >> InvalidRequestException(why='No indexed columns present in index clause with >> operator EQ'): >> >> cf = pycassa.ColumnFamily(client, "MyCF") >>> ex = pycassa.index.create_index_expression('ColA', '50', >>> pycassa.index.IndexOperator.LTE) >>> clause = pycassa.index.create_index_clause([ex]) >>> cf.get_indexed_slices(clause) >>> >> >> Am I missing something? >> >> Regards, >> Chris >> > > > > -- > Riptano > Software and Support for Apache Cassandra > http://www.riptano.com/ > mden...@riptano.com > m: 512.587.0900 f: 866.583.2068
Re: Creating and using indices
So basically my indices should work? Is there a simple way to check that, so that we can exclude that? Are LTE working (or on the roadmap for the 0.7.0 release)? Or does the EQ operator have to math anything or can I just add an EQ to a not existing value to get LTE working too? Regards, Chris On Thu, Oct 7, 2010 at 4:57 PM, Petr Odut wrote: > What I've tested you must include at least one expression with EQ operator > > > On Thu, Oct 7, 2010 at 3:45 PM, Matthew Dennis wrote: > >> If I remember correctly the only operator supported for secondary indexes >> right now is EQ, not LTE (or the others). >> >> >> On Thu, Oct 7, 2010 at 6:13 AM, Christian Decker < >> decker.christ...@gmail.com> wrote: >> >>> I'm currently trying to get started on secondary indices in Cassandra >>> 0.7.0svn, but without any luck so far. I have the following code that should >>> create an index on ColA: >>> >>> KsDef ksDef = client.describe_keyspace("MyKeyspace"); List cfs = ksDef.cf_defs; String columnFamily = "MyCF"; for(CfDef cf : ksDef.cf_defs){ if(cf.getName().equals(columnFamily)){ ColumnDef cd1 = new ColumnDef("ColA".getBytes(), "org.apache.cassandra.db.marshal.UTF8Type"); cd1.index_type = IndexType.KEYS; cf.column_metadata.add(cd1); // Write changes back to DB client.system_update_column_family(cf); } } >>> >>> which seems to work nicely since when turning up the logging level of >>> Cassandra it appears to apply some migrations, but then when I try to use a >>> pycassa client to read an indexed_slice I only get an >>> InvalidRequestException(why='No indexed columns present in index clause with >>> operator EQ'): >>> >>> cf = pycassa.ColumnFamily(client, "MyCF") ex = pycassa.index.create_index_expression('ColA', '50', pycassa.index.IndexOperator.LTE) clause = pycassa.index.create_index_clause([ex]) cf.get_indexed_slices(clause) >>> >>> Am I missing something? >>> >>> Regards, >>> Chris >>> >> >> >> >> -- >> Riptano >> Software and Support for Apache Cassandra >> http://www.riptano.com/ >> mden...@riptano.com >> m: 512.587.0900 f: 866.583.2068 > >
Re: Creating and using indices
Actually, you're trying to add an index to an already existing column family here, right? That's not yet supported, but should be soon. - Tyler On Thu, Oct 7, 2010 at 10:13 AM, Christian Decker < decker.christ...@gmail.com> wrote: > So basically my indices should work? Is there a simple way to check that, > so that we can exclude that? > > Are LTE working (or on the roadmap for the 0.7.0 release)? Or does the EQ > operator have to math anything or can I just add an EQ to a not existing > value to get LTE working too? > > Regards, > Chris > > > On Thu, Oct 7, 2010 at 4:57 PM, Petr Odut wrote: > >> What I've tested you must include at least one expression with EQ operator >> >> >> On Thu, Oct 7, 2010 at 3:45 PM, Matthew Dennis wrote: >> >>> If I remember correctly the only operator supported for secondary indexes >>> right now is EQ, not LTE (or the others). >>> >>> >>> On Thu, Oct 7, 2010 at 6:13 AM, Christian Decker < >>> decker.christ...@gmail.com> wrote: >>> I'm currently trying to get started on secondary indices in Cassandra 0.7.0svn, but without any luck so far. I have the following code that should create an index on ColA: KsDef ksDef = client.describe_keyspace("MyKeyspace"); > List cfs = ksDef.cf_defs; > String columnFamily = "MyCF"; > for(CfDef cf : ksDef.cf_defs){ > if(cf.getName().equals(columnFamily)){ > ColumnDef cd1 = new ColumnDef("ColA".getBytes(), > "org.apache.cassandra.db.marshal.UTF8Type"); > cd1.index_type = IndexType.KEYS; > cf.column_metadata.add(cd1); > // Write changes back to DB > client.system_update_column_family(cf); > } > } > which seems to work nicely since when turning up the logging level of Cassandra it appears to apply some migrations, but then when I try to use a pycassa client to read an indexed_slice I only get an InvalidRequestException(why='No indexed columns present in index clause with operator EQ'): cf = pycassa.ColumnFamily(client, "MyCF") > ex = pycassa.index.create_index_expression('ColA', '50', > pycassa.index.IndexOperator.LTE) > clause = pycassa.index.create_index_clause([ex]) > cf.get_indexed_slices(clause) > Am I missing something? Regards, Chris >>> >>> >>> >>> -- >>> Riptano >>> Software and Support for Apache Cassandra >>> http://www.riptano.com/ >>> mden...@riptano.com >>> m: 512.587.0900 f: 866.583.2068 >> >> >
Re: Creating and using indices
On Thu, Oct 7, 2010 at 10:13 AM, Christian Decker wrote: > So basically my indices should work? Is there a simple way to check that, so > that we can exclude that? > > Are LTE working (or on the roadmap for the 0.7.0 release)? No, LT[E] is not on the roadmap for primary index clauses (GT[E] is, for 0.7.1). So you would want to create an index with an inverted comparator, to turn LTE into GTE. > Or does the EQ > operator have to math anything or can I just add an EQ to a not existing > value to get LTE working too? If you ask for EQ not-existing-value you will get no results back, of course. -- Jonathan Ellis Project Chair, Apache Cassandra co-founder of Riptano, the source for professional Cassandra support http://riptano.com
Re: Advice on settings
> However Ben Black suggests here that the cleanup will actually only > impact data deleted through the API: > > http://comments.gmane.org/gmane.comp.db.cassandra.user/4437 > > In this case, I guess that we need not worry too much about the > setting since we are actually updating, never deleting. Is this the > case? Yes, that's correct. GCGraceSeconds affects the lifetime of tombstones, which are needed only when deleting data. SImple overwrites do not involve tombstones, and GCGraceSeconds is not in play. Overwritten columns are eliminated when their sstables are compacted. > *Replication factor* > > Our use case is many more writes than reads, but when we do have reads > they're random (we're not currently using hadoop to read entire CFs). > I'm wondering what sort of level of RF to have for a cluster. We > currently have 12 nodes and RF=4. > > To improve read performance I'm thinking of upping the number of nodes > and keeping RF at 4. In the absence of other bottlenecks, this makes sense yes. Another thing to consider is whether to turn off (if on 0.6) or adjust the frequency of (in 0.7) read-repair. If read repair is turned on (0.6) or at 100% (0.7), each read will hit RF numbers of nodes (even if you are reading at a low consistency level, with read repair, other nodes are still asked to read the data and send back a checksum). If you expect to be I/O bound to due low locality of access (the random reads), this could potentially yield up to a factor of RF (in your case 4) expected read throughput. Whether or not turning off or decreasing read repair is acceptable is of course up to your situation; and in particular if you read at e.q. QUOROM you will still read from 3 (in the case of RF=4) nodes regardless of read repair settings. > My understanding is that this means we're sharing > the data around more. Not sure what you mean. Given a constant RF of 4, you will still have 4 copies, but they will be distributed across additional machines, meaning each machine has less data and presumably gets less requests. > However it also means a client read to a random > node has less chance of actually connecting to one of the nodes with > the data on. Keep in mind though that hitting the right node is somewhat of a special case, and the overhead is limited to whatever the cost of RPC is. If you are expecting to bottleneck on disk seeks (judging again by your random read comment), I would say you can completely ignore this. When I say it's a special case, I mean that you're adding between 0 and 1 units of RPC overhead (on average); no matter how large your cluster is, your RPC overhead is won't exceed 1, with 1 being whatever the cost is to forward a request+response. > On a similar note (read perf), I'm guessing that reading at weak > consistency level will bring gains. Gleamed from this slide amongst > other places: > > http://www.slideshare.net/mobile/benjaminblack/introduction-to-cassandra-replication-and-consistency#13 > > Is this true, or will read repair still hammer disks in all the > machines with the data on? Again I guess it's better to have low RF so > there are less copied of the data to inspect when doing read repair. > Will this result in better read performance? Sorry, I did the impolite thing and began responding before having read your entire E-Mail ;) So yes, a low RF would increase read performance, but assuming you care about data redundancy the better way to achieve that effect is probably to decrease or disable read repair. -- / Peter Schuller
Re: Advice on settings
if you are updating columns quite rapidly, you will scatter the columns over many sstables as you update them over time. this means that a read of a specific column will require looking at more sstables to find the data. performing a compaction (using nodetool) will merge the sstables into one making your reads more performant. of course the more columns, the more scattering around, the more I/O. to your point about "sharing the data around". adding more machines is always a good thing to spread the load - you add RAM, CPU, and persistent storage to the cluster. there probably is some point where enough machines creates a lot of network traffic, but 10 or 20 machines shouldn't be an issue. don't worry about trying to hit a node that has the data unless your machines are connected across slow network links. On 10/07/2010 12:48 AM, Dave Gardner wrote: Hi all We're rolling out a Cassandra cluster on EC2 and I've got a couple if questions about settings. I'm interested to hear what other people have experienced with different values and generally seek advice. *gcgraceseconds* Currently we configure one setting for all CFs. We experimented with this a bit during testing, including changing from the default (10 days) to 3 hours. Our use case involves lots of rewriting the columns for any given keys. We probably rewrite around 5 million per day. We are thinking of setting this to around 3 days for production so that we don't have old copies of data hanging round. Is there anything obviously wrong with this? Out of curiosity, would there be any performance issues if we had this set to 30 days? My understanding is that it would only affect the amount of disk space used. However Ben Black suggests here that the cleanup will actually only impact data deleted through the API: http://comments.gmane.org/gmane.comp.db.cassandra.user/4437 In this case, I guess that we need not worry too much about the setting since we are actually updating, never deleting. Is this the case? *Replication factor* Our use case is many more writes than reads, but when we do have reads they're random (we're not currently using hadoop to read entire CFs). I'm wondering what sort of level of RF to have for a cluster. We currently have 12 nodes and RF=4. To improve read performance I'm thinking of upping the number of nodes and keeping RF at 4. My understanding is that this means we're sharing the data around more. However it also means a client read to a random node has less chance of actually connecting to one of the nodes with the data on. I'm assuming this is fine. What sort of RFs do others use? With a huge cluster like the recently mentioned 400 node US govt cluster, what sort of RF is sane? On a similar note (read perf), I'm guessing that reading at weak consistency level will bring gains. Gleamed from this slide amongst other places: http://www.slideshare.net/mobile/benjaminblack/introduction-to-cassandra-replication-and-consistency#13 Is this true, or will read repair still hammer disks in all the machines with the data on? Again I guess it's better to have low RF so there are less copied of the data to inspect when doing read repair. Will this result in better read performance? Thanks dave
Re: Advice on settings
Also, as a note related to EC2, choose whether you want to be in multiple availability zones. The highest performance possible is to be in a single AZ, as all those machines will have *very* high speed interconnects. But, individual AZs also can suffer outages. You can distribute your instances across, say, 2 AZs, and then use a RackAwareStrategy to force replication to put at least 1 copy of the data into the other AZ. Also, it's easiest to stay within a single Region (in EC2-speak). This allows you to use the internal IP addresses for Gossip and Thrift connections - which means you do not pay inbound-outbound fees for the data xfer. HTH, Dave Viner On Thu, Oct 7, 2010 at 10:26 AM, B. Todd Burruss wrote: > if you are updating columns quite rapidly, you will scatter the columns > over many sstables as you update them over time. this means that a read of > a specific column will require looking at more sstables to find the data. > performing a compaction (using nodetool) will merge the sstables into one > making your reads more performant. of course the more columns, the more > scattering around, the more I/O. > > to your point about "sharing the data around". adding more machines is > always a good thing to spread the load - you add RAM, CPU, and persistent > storage to the cluster. there probably is some point where enough machines > creates a lot of network traffic, but 10 or 20 machines shouldn't be an > issue. don't worry about trying to hit a node that has the data unless your > machines are connected across slow network links. > > > On 10/07/2010 12:48 AM, Dave Gardner wrote: > >> Hi all >> >> We're rolling out a Cassandra cluster on EC2 and I've got a couple if >> questions about settings. I'm interested to hear what other people >> have experienced with different values and generally seek advice. >> >> *gcgraceseconds* >> >> Currently we configure one setting for all CFs. We experimented with >> this a bit during testing, including changing from the default (10 >> days) to 3 hours. Our use case involves lots of rewriting the columns >> for any given keys. We probably rewrite around 5 million per day. >> >> We are thinking of setting this to around 3 days for production so >> that we don't have old copies of data hanging round. Is there anything >> obviously wrong with this? Out of curiosity, would there be any >> performance issues if we had this set to 30 days? My understanding is >> that it would only affect the amount of disk space used. >> >> However Ben Black suggests here that the cleanup will actually only >> impact data deleted through the API: >> >> http://comments.gmane.org/gmane.comp.db.cassandra.user/4437 >> >> In this case, I guess that we need not worry too much about the >> setting since we are actually updating, never deleting. Is this the >> case? >> >> >> *Replication factor* >> >> Our use case is many more writes than reads, but when we do have reads >> they're random (we're not currently using hadoop to read entire CFs). >> I'm wondering what sort of level of RF to have for a cluster. We >> currently have 12 nodes and RF=4. >> >> To improve read performance I'm thinking of upping the number of nodes >> and keeping RF at 4. My understanding is that this means we're sharing >> the data around more. However it also means a client read to a random >> node has less chance of actually connecting to one of the nodes with >> the data on. I'm assuming this is fine. What sort of RFs do others >> use? With a huge cluster like the recently mentioned 400 node US govt >> cluster, what sort of RF is sane? >> >> On a similar note (read perf), I'm guessing that reading at weak >> consistency level will bring gains. Gleamed from this slide amongst >> other places: >> >> >> http://www.slideshare.net/mobile/benjaminblack/introduction-to-cassandra-replication-and-consistency#13 >> >> Is this true, or will read repair still hammer disks in all the >> machines with the data on? Again I guess it's better to have low RF so >> there are less copied of the data to inspect when doing read repair. >> Will this result in better read performance? >> >> Thanks >> >> dave >> >> >> >> >
Heap Settings suggestions
>From the Cassandra documentation @ riptano I see the following recommendation for Heap size setting MemtableThroughputInMB * 3 * (number of ColumnFamilies) + 1G + (size of internal caches) What if there is more than one keyspace in the system ? Assuming each keyspace has the same number of column families, Can I linearly scale the above recommendation to the number of keyspaces in the system .ie, if the "X" is the heap size for a single keyspace and there are "Y" keyspaces, Is it recommended to allocate "XY" as the max Heap size ? Please let me know. Thanks Kannan PS: Thanks a lot for the documentation and recommendations.
Re: Heap Settings suggestions
> What if there is more than one keyspace in the system ? Assuming each > keyspace has the same number of column families, Can I linearly scale the > above recommendation to the number of keyspaces in the system .ie, if the > "X" is the heap size for a single keyspace and there are "Y" keyspaces, Is > it recommended to allocate "XY" as the max Heap size ? Please let me know. Yes. Each column family will have a memtable subject to the configured memory constraints; whether or not they are in different keyspaces does not matter. -- / Peter Schuller
Re: Tuning cassandra to use less memory
> The nodes are still swapping, even though the swappiness is set to zero > right now. After swapping comes the OOM. In addition to what's already been said, consider just flat out disabling swap completely, unless you have other things on the machine that cause swap to be significantly useful (i.e., lots of truly unused stuff that is good to keep swapped out). -- / Peter Schuller
Re: Dazed and confused with Cassandra on EC2 ...
> There's some words on the 'Net that - the recent pages on > Riptano's site in fact - that strongly encourage scaling left > and right, rather than beefing up the boxes - and certainly > we're seeing far less bother from GC using a much smaller > heap - previously we'd been going up to 16GB, or even > higher. This is based on my previous positive experiences > of getting better performance from memory hog apps (eg. > Java) by giving them more memory. In any case, it seems > that using large amounts of memory on EC2 is just asking > for trouble. Keep in mind that while GC tends to be more efficient with larger heap sizes, that does not always translate into better overall performance when other things have to be considered. In particular, in the case of Cassandra, if you "waste" 10-15 gigs of RAM on the JVM heap for a Cassandra instances which could live with e.g. 1 GB, you're actively taking away those 10-15 gigs of RAM from the operating system to use for the buffer cache. Particularly if you're I/O bound on reads then, this could have very detrimental effects (assuming the data set is sufficiently small and locality is such that 15 GB of extra buffer cache makes a difference; usually, but not always, this is the case). So with Cassandra, in the general case, you definitely want to keep hour heap size reasonable in relation to the actual live set (amount of actually reachable data), rather than just cranking it up as much as possible. (The main issue here is also keeping it high enough to not OOM, given that exact memory demands are hard to predict; it would be absolutely great if the JVM was better at maintaining a reasonable heap size to live set size ratio so that much less tweaking of heap sizes was necessary, but this is not the case.) -- / Peter Schuller
Retrieving dead node's token from system keyspace
Hey all, I had a node go down that I'm not able to get a token for from nodetool ring. The wiki says: "You can obtain the dead node's token by running nodetool ring on any live node, unless there was some kind of outage, and the others came up but not the down one -- in that case, you can retrieve the token from the live nodes' system tables." But, I can't for the life of me figure out how to get the system keyspace to give up the secret. All attempts end up in: ERROR [pool-1-thread-2] 2010-10-07 21:20:44,865 Cassandra.java (line 1280) Internal error processing get_slice java.lang.RuntimeException: No replica strategy configured for system Can someone point me at a good way to get the token? Thanks -Allan
Re: Retrieving dead node's token from system keyspace
I was able to figure out to use the sstable2json tool to get the values out of the system keyspace. Unfortunately, the node that went down took all of it's data with it and I only have access to the system keyspace of the remaining live node. There were only two nodes and the one left should have a whole DB copy. Running removetoken on any of the values that appeared to be tokens in the LocationInfo cf hasn't done any good. Perhaps I'm missing which value is the token of the dead node? Or, is there a way to take down the last node and bring back up a new cluster using the sstables that I have on the remaining node? -Allan On Oct 7, 2010, at 3:22 PM, Allan Carroll wrote: > Hey all, > > I had a node go down that I'm not able to get a token for from nodetool ring. > > The wiki says: > > "You can obtain the dead node's token by running nodetool ring on any live > node, unless there was some kind of outage, and the others came up but not > the down one -- in that case, you can retrieve the token from the live nodes' > system tables." > > But, I can't for the life of me figure out how to get the system keyspace to > give up the secret. All attempts end up in: > > ERROR [pool-1-thread-2] 2010-10-07 21:20:44,865 Cassandra.java (line 1280) > Internal error processing get_slice > java.lang.RuntimeException: No replica strategy configured for system > > > Can someone point me at a good way to get the token? > > Thanks > -Allan
Cassandra and EC2 performance testing
I recently posted a blog article about Cassandra and EC2 performance testing for small vs large, EBS vs ephemeral storage, compared to real HW with and without an SSD. Hope people find it interesting. http://www.coreyhulen.org/?p=326 Highlights: - The variance in test results from run to run on EC2’s virtual hardware fluctuates A LOT. - EC2 is a finicky beast, but we like it. - Not all EC2 instances (for the same size ie. small) are created equal. - Large instances are not 4x as fast as small instances (even though they are 4x the price). - Kind of obvious, but real hardware is better…and yea SSD’s kick butt. - Automated scripts included. Please have at it and reproduce the results with different configurations. Thanks, -Corey
Re: Heap Settings suggestions
Keep in mind that .7 and on will have per-CF settings for most things so there will be even more control over the the tuning... On Oct 7, 2010 3:10 PM, "Peter Schuller" wrote: >> What if there is more than one keyspace in the system ? Assuming each >> keyspace has the same number of column families, Can I linearly scale the >> above recommendation to the number of keyspaces in the system .ie, if the >> "X" is the heap size for a single keyspace and there are "Y" keyspaces, Is >> it recommended to allocate "XY" as the max Heap size ? Please let me know. > > Yes. Each column family will have a memtable subject to the configured > memory constraints; whether or not they are in different keyspaces > does not matter. > > -- > / Peter Schuller
Re: Tuning cassandra to use less memory
+1 on disabling swap On Oct 7, 2010 3:27 PM, "Peter Schuller" wrote: >> The nodes are still swapping, even though the swappiness is set to zero >> right now. After swapping comes the OOM. > > In addition to what's already been said, consider just flat out > disabling swap completely, unless you have other things on the machine > that cause swap to be significantly useful (i.e., lots of truly unused > stuff that is good to keep swapped out). > > > -- > / Peter Schuller
RE: Newbie Question about restarting Cassandra
Are there any data loss concerns if you have the commit log sync set to periodic and are writing with CL One or Any? From: Matthew Dennis [mailto:mden...@riptano.com] Sent: Wednesday, October 06, 2010 8:53 PM To: user@cassandra.apache.org Subject: Re: Newbie Question about restarting Cassandra Rob is correct. drain is really on there for when you need the commit log to be empty (some upgrades or a complete backup of a shutdown cluster). There really is no point to using to shutdown C* normally, just kill it... On Wed, Oct 6, 2010 at 4:18 PM, Rob Coli wrote: On 10/6/10 1:13 PM, Aaron Morton wrote: To shutdown cleanly, say in a production system, use nodetool drain first. This will flush the memtables and put the node into a read only mode, AFAIK this also gives the other nodes a faster way of detecting the node is down via the drained node gossiping it's new status. Then kill. FWIW, the gossiper related code for "drain" (trunk) looks like it just stops the gossip service, which is almost certainly the same thing that happens if you kill Cassandra. ./src/java/org/apache/cassandra/service/StorageService.java " public synchronized void drain() throws IOException, InterruptedException, ExecutionException ... setMode("Starting drain process", true); Gossiper.instance.stop(); " ./src/java/org/apache/cassandra/gms/Gossiper.java " public void stop() { scheduledGossipTask.cancel(false); } " =Rob -- Riptano Software and Support for Apache Cassandra http://www.riptano.com/ mden...@riptano.com m: 512.587.0900 f: 866.583.2068
Re: Retrieving dead node's token from system keyspace
Allan, I'm a bit confused about what you are trying to do here. You have 2 nodes with RF = ? , you lost one node completely and now you want to...Just get a cluster running again, don't worry about the data.ORRestore the data from the dead node. ORCreate a cluster with the data from the remaining node and a new node.AaronOn 08 Oct, 2010,at 11:15 AM, Allan Carroll wrote:I was able to figure out to use the sstable2json tool to get the values out of the system keyspace. Unfortunately, the node that went down took all of it's data with it and I only have access to the system keyspace of the remaining live node. There were only two nodes and the one left should have a whole DB copy. Running removetoken on any of the values that appeared to be tokens in the LocationInfo cf hasn't done any good. Perhaps I'm missing which value is the token of the dead node? Or, is there a way to take down the last node and bring back up a new cluster using the sstables that I have on the remaining node? -Allan On Oct 7, 2010, at 3:22 PM, Allan Carroll wrote: > Hey all, > > I had a node go down that I'm not able to get a token for from nodetool ring. > > The wiki says: > > "You can obtain the dead node's token by running nodetool ring on any live node, unless there was some kind of outage, and the others came up but not the down one -- in that case, you can retrieve the token from the live nodes' system tables." > > But, I can't for the life of me figure out how to get the system keyspace to give up the secret. All attempts end up in: > > ERROR [pool-1-thread-2] 2010-10-07 21:20:44,865 Cassandra.java (line 1280) Internal error processing get_slice > java.lang.RuntimeException: No replica strategy configured for system > > > Can someone point me at a good way to get the token? > > Thanks > -Allan
Re: Newbie Question about restarting Cassandra
Yes. You probably shouldn't ever be using CL.ANY (though I'm certain there are others that disagree with me; I wish them the best of luck with that). CL.ONE + periodic sync can potentially lose recently written data, but if you care about that then you better care enough about your data to use something greater than CL.ONE. With CL.ONE + periodic: If your disk dies you lose data. If your OS crashes on a node you lose data. If the processor melts you lose data. If your memory goes bad you lose data. If your UPS is interrupted you lose data. if X (for many values of X) you lose data. Not that for some situations (e.g. disk failure) it doesn't matter what the commit log sync (batch v periodic) is set to, you lose data. If your C* process dies and/or is killed you should not lose data. It's written to the commit log before the client is acked, however that entry may not have made it to disk yet in the case of commitlogsync=periodic. So, if you kill the C* process you're fine. If you nicely restart the OS, you should be fine (assuming your boxen/raid controllers/disks/etc do the sane thing). If you nuke your OS, then see above about losing data on CL.ONE. On Thu, Oct 7, 2010 at 7:11 PM, David McIntosh wrote: > Are there any data loss concerns if you have the commit log sync set to > periodic and are writing with CL One or Any? > > > > *From:* Matthew Dennis [mailto:mden...@riptano.com] > *Sent:* Wednesday, October 06, 2010 8:53 PM > *To:* user@cassandra.apache.org > *Subject:* Re: Newbie Question about restarting Cassandra > > > > Rob is correct. > > drain is really on there for when you need the commit log to be empty (some > upgrades or a complete backup of a shutdown cluster). > > There really is no point to using to shutdown C* normally, just kill it... > > On Wed, Oct 6, 2010 at 4:18 PM, Rob Coli wrote: > > On 10/6/10 1:13 PM, Aaron Morton wrote: > > To shutdown cleanly, say in a production system, use nodetool drain > first. This will flush the memtables and put the node into a read only > mode, AFAIK this also gives the other nodes a faster way of detecting > the node is down via the drained node gossiping it's new status. Then kill. > > > > FWIW, the gossiper related code for "drain" (trunk) looks like it just > stops the gossip service, which is almost certainly the same thing that > happens if you kill Cassandra. > > ./src/java/org/apache/cassandra/service/StorageService.java > " >public synchronized void drain() throws IOException, > InterruptedException, ExecutionException > ... > setMode("Starting drain process", true); >Gossiper.instance.stop(); > " > > ./src/java/org/apache/cassandra/gms/Gossiper.java > " > public void stop() >{ >scheduledGossipTask.cancel(false); >} > " > > =Rob > > > > > -- > Riptano > Software and Support for Apache Cassandra > http://www.riptano.com/ > mden...@riptano.com > m: 512.587.0900 f: 866.583.2068 >
Re: Retrieving dead node's token from system keyspace
Allan, I'm confused on why removetoken doesn't do anything and would be interested in finding out why, but to answer your question: You can shutdown down your last node, nuke the system directory (make a backup just in case), restart the node, load the schema (export it first if need be) and be one your way. You should end up with a node that is the only one in the ring. Again, make a backup of the the system directory (actually, might as well just backup the entire data and commitlog directories) before you start nuking stuff. On Thu, Oct 7, 2010 at 7:12 PM, Aaron Morton wrote: > Allan, > I'm a bit confused about what you are trying to do here. You have 2 nodes > with RF = ? , you lost one node completely and now you want to... > > Just get a cluster running again, don't worry about the data. > OR > Restore the data from the dead node. > OR > Create a cluster with the data from the remaining node and a new node. > > Aaron > > > On 08 Oct, 2010,at 11:15 AM, Allan Carroll wrote: > > I was able to figure out to use the sstable2json tool to get the values out > of the system keyspace. > > Unfortunately, the node that went down took all of it's data with it and I > only have access to the system keyspace of the remaining live node. There > were only two nodes and the one left should have a whole DB copy. > > Running removetoken on any of the values that appeared to be tokens in the > LocationInfo cf hasn't done any good. Perhaps I'm missing which value is the > token of the dead node? Or, is there a way to take down the last node and > bring back up a new cluster using the sstables that I have on the remaining > node? > > -Allan > > On Oct 7, 2010, at 3:22 PM, Allan Carroll wrote: > > > Hey all, > > > > I had a node go down that I'm not able to get a token for from nodetool > ring. > > > > The wiki says: > > > > "You can obtain the dead node's token by running nodetool ring on any > live node, unless there was some kind of outage, and the others came up but > not the down one -- in that case, you can retrieve the token from the live > nodes' system tables." > > > > But, I can't for the life of me figure out how to get the system keyspace > to give up the secret. All attempts end up in: > > > > ERROR [pool-1-thread-2] 2010-10-07 21:20:44,865 Cassandra.java (line > 1280) Internal error processing get_slice > > java.lang.RuntimeException: No replica strategy configured for system > > > > > > Can someone point me at a good way to get the token? > > > > Thanks > > -Allan > > -- Riptano Software and Support for Apache Cassandra http://www.riptano.com/ mden...@riptano.com m: 512.587.0900 f: 866.583.2068
Re: Heap Settings suggestions
Good point.. Thanks to both of you for the replies. Kannan From: Matthew Dennis To: user@cassandra.apache.org Sent: Thu, October 7, 2010 4:59:28 PM Subject: Re: Heap Settings suggestions Keep in mind that .7 and on will have per-CF settings for most things so there will be even more control over the the tuning... On Oct 7, 2010 3:10 PM, "Peter Schuller" wrote: >> What if there is more than one keyspace in the system ? Assuming each >> keyspace has the same number of column families, Can I linearly scale the >> above recommendation to the number of keyspaces in the system .ie, if the >> "X" is the heap size for a single keyspace and there are "Y" keyspaces, Is >> it recommended to allocate "XY" as the max Heap size ? Please let me know. > > Yes. Each column family will have a memtable subject to the configured > memory constraints; whether or not they are in different keyspaces > does not matter. > > -- > / Peter Schuller
Re: Dazed and confused with Cassandra on EC2 ...
Also, in general, you probably want to set Xms = Xmx (regardless of the value you eventually decide on for that). If you set them equal, the JVM will just go ahead and allocate that amount on startup. If they're different, then when you grow above Xms it has to allocate more and move a bunch of stuff around. It may have to do this multiple times. Note that it does this as the worst time possible (i.e. under heavy load, which is likely what caused you to grow past Xms in the first place). On Thu, Oct 7, 2010 at 2:49 PM, Peter Schuller wrote: > > There's some words on the 'Net that - the recent pages on > > Riptano's site in fact - that strongly encourage scaling left > > and right, rather than beefing up the boxes - and certainly > > we're seeing far less bother from GC using a much smaller > > heap - previously we'd been going up to 16GB, or even > > higher. This is based on my previous positive experiences > > of getting better performance from memory hog apps (eg. > > Java) by giving them more memory. In any case, it seems > > that using large amounts of memory on EC2 is just asking > > for trouble. > > Keep in mind that while GC tends to be more efficient with larger heap > sizes, that does not always translate into better overall performance > when other things have to be considered. In particular, in the case of > Cassandra, if you "waste" 10-15 gigs of RAM on the JVM heap for a > Cassandra instances which could live with e.g. 1 GB, you're actively > taking away those 10-15 gigs of RAM from the operating system to use > for the buffer cache. Particularly if you're I/O bound on reads then, > this could have very detrimental effects (assuming the data set is > sufficiently small and locality is such that 15 GB of extra buffer > cache makes a difference; usually, but not always, this is the case). > > So with Cassandra, in the general case, you definitely want to keep > hour heap size reasonable in relation to the actual live set (amount > of actually reachable data), rather than just cranking it up as much > as possible. > > (The main issue here is also keeping it high enough to not OOM, given > that exact memory demands are hard to predict; it would be absolutely > great if the JVM was better at maintaining a reasonable heap size to > live set size ratio so that much less tweaking of heap sizes was > necessary, but this is not the case.) > > -- > / Peter Schuller >
Out of Memory Issues - SERIOUS
There seems to have been a fair amount of discussion on memory related issues so I apologize if this exact situation has come up before. I am currently in the process of load testing an metrics platform I have written which uses Cassandra and I have run into some very troubling issues. The application is writing quite heavily, about 1000-2000 updates (columns) per second using batch mutates of 20 columns each. This is divided between creating new rows and adding columns to a fairly limited number of existing index rows (<30). Nearly all of these updates are read within 10 seconds and none contain any significant amount of data (generally much less than 100 bytes of data which I specify). Initially, the test hums along nicely but after some amount of time (1-2 hours) Cassandra crashes with an out of memory error. Unfortunately I have not had the opportunity to watch the test as it crashes, but it has happened in 2/2 tests. This is quite annoying but the absolutely TERRIFYING behaviour is that when I restart Cassandra, it starts replaying the commit logs then crashes with an out of memory error again. Restart a second time, crash with OOM; it seems to get through about 3/4 of the commit logs. Just to be absolutely explicit, I am not trying to insert or read at this point, just recover the previous updates. Unless somebody can suggest a way to recover the commit logs, I have effectively lost my data. The only way I have found to recover is wipe the data directories. It does not matter right now given that it is only a test but this behaviour is completely unacceptable for a production system. Here is information about the system which is probably relevant. Let me know if any additional details about my application would help sort out this issue: - Cassandra 0.7 Beta2 - DB Machine: EC2 m1 large with the commit log directory on an ebs and the data directory on ephemeral storage. - OS: Ubuntu server 10.04 - With the exception of changing JMX settings, no memory or JVM changes were made to options in cassandra-env.sh - In Cassandra.yaml, I reduced binary_memtable_throughput_in_mb to 100 in my second test to try follow the heap memory calculation formula; I have 8 column families. - I am using the Sun JVM, specifically "build 1.6.0_20-b02" - The app is written in java and I am using the latest Pelops library, I am sending updates at consistency level ONE and reading them at level ALL. I have been fairly impressed with Cassandra overall and given that I am using a beta version, I don't expect fully polished behaviour. What is unacceptable, and quite frankly nearly unbelievable, is the fact Cassandra cant seem to recover from the error and I am loosing data. Dan Hendry
Re: Out of Memory Issues - SERIOUS
if you don't want to lose data, don't wipe your commit logs. that part seems pretty obvious to me. :) cassandra aggressively logs its state when it is running out of memory so you can troubleshoot. look for the GCInspector lines in the log. but in this case it sounds pretty simple; you will be able to finish replaying the commitlogs if you lower your memtable thresholds or alternatively increase the amount of memory given to the JVM. (see http://wiki.apache.org/cassandra/MemtableSSTable.) the _binary_ memtable setting has no effect on commitlog replay (it has no effect on anything but binary writes through the storageproxy api, which you are not using), you need to adjust memtable_throughput_in_mb and memtable_operations_in_millions. If you haven't explicitly set these then Cassandra will guess based on your heap size; here, it is guessing too high. start by uncommenting the settings in the .yaml and reduce by 50% until it works. alternatively, apply the patch at https://issues.apache.org/jira/browse/CASSANDRA-1595 to see what Cassandra is guessing, and start at half of that. On Thu, Oct 7, 2010 at 10:32 PM, Dan Hendry wrote: > There seems to have been a fair amount of discussion on memory related > issues so I apologize if this exact situation has come up before. > > > > I am currently in the process of load testing an metrics platform I have > written which uses Cassandra and I have run into some very troubling issues. > The application is writing quite heavily, about 1000-2000 updates (columns) > per second using batch mutates of 20 columns each. This is divided between > creating new rows and adding columns to a fairly limited number of existing > index rows (<30). Nearly all of these updates are read within 10 seconds and > none contain any significant amount of data (generally much less than 100 > bytes of data which I specify). Initially, the test hums along nicely but > after some amount of time (1-2 hours) Cassandra crashes with an out of > memory error. Unfortunately I have not had the opportunity to watch the test > as it crashes, but it has happened in 2/2 tests. > > > > This is quite annoying but the absolutely TERRIFYING behaviour is that when > I restart Cassandra, it starts replaying the commit logs then crashes with > an out of memory error again. Restart a second time, crash with OOM; it > seems to get through about 3/4 of the commit logs. Just to be absolutely > explicit, I am not trying to insert or read at this point, just recover the > previous updates. Unless somebody can suggest a way to recover the commit > logs, I have effectively lost my data. The only way I have found to recover > is wipe the data directories. It does not matter right now given that it is > only a test but this behaviour is completely unacceptable for a production > system. > > > > Here is information about the system which is probably relevant. Let me know > if any additional details about my application would help sort out this > issue: > > - Cassandra 0.7 Beta2 > > - DB Machine: EC2 m1 large with the commit log directory on an ebs > and the data directory on ephemeral storage. > > - OS: Ubuntu server 10.04 > > - With the exception of changing JMX settings, no memory or JVM > changes were made to options in cassandra-env.sh > > - In Cassandra.yaml, I reduced binary_memtable_throughput_in_mb to > 100 in my second test to try follow the heap memory calculation formula; I > have 8 column families. > > - I am using the Sun JVM, specifically “build 1.6.0_20-b02” > > - The app is written in java and I am using the latest Pelops > library, I am sending updates at consistency level ONE and reading them at > level ALL. > > > > I have been fairly impressed with Cassandra overall and given that I am > using a beta version, I don’t expect fully polished behaviour. What is > unacceptable, and quite frankly nearly unbelievable, is the fact Cassandra > cant seem to recover from the error and I am loosing data. > > > > Dan Hendry -- Jonathan Ellis Project Chair, Apache Cassandra co-founder of Riptano, the source for professional Cassandra support http://riptano.com