Re: Cassandra compression not working?
I forgot to mention we are running Cassandra 1.1.2. Thanks, -Mike On Sep 24, 2012, at 5:00 PM, Michael Theroux wrote: > Hello, > > We are running into an unusual situation that I'm wondering if anyone has any > insight on. We've been running a Cassandra cluster for some time, with > compression enabled on one column family in which text documents are stored. > We enabled compression on the column family, utilizing the SnappyCompressor > and a 64k chunk length. > > It was recently discovered that Cassandra was reporting a compression ratio > of 0. I took a snapshot of the data and started a cassandra node in > isolation to investigate. > > Running nodetool scrub, or nodetool upgradesstables had little impact on the > amount of data that was being stored. > > I then disabled compression and ran nodetool upgradesstables on the column > family. Again, not impact on the data size stored. > > I then reenabled compression and ran nodetool upgradesstables on the column > family. This resulting in a 60% reduction in the data size stored, and > Cassandra reporting a compression ration of about .38. > > Any idea what is going on here? Obviously I can go through this process in > production to enable compression, however, any idea what is currently > happening and why new data does not appear to be compressed? > > Any insights are appreciated, > Thanks, > -Mike
Row caching + Wide row column family == almost crashed?
Hello, We recently hit an issue within our Cassandra based application. We have a relatively new Column Family with some very wide rows (10's of thousands of columns, or more in some cases). During a periodic activity, we the range of columns to retrieve various pieces of information, a segment at a time. We do these same queries frequently at various stages of the process, and I thought the application could see a performance benefit from row caching. We have a small row cache (100MB per node) already enabled, and I enabled row caching on the new column family. The results were very negative. When performing range queries with a limit of 200 results, for a small minority of the rows in the new column family, performance plummeted. CPU utilization on the Cassandra node went through the roof, and it started chewing up memory. Some queries to this column family hung completely. According to the logs, we started getting frequent GCInspector messages. Cassandra started flushing the largest mem_tables due to hitting the "flush_largest_memtables_at" of 75%, and scaling back the key/row caches. However, to Cassandra's credit, it did not die with an OutOfMemory error. Its measures to emergency measures to conserve memory worked, and the cluster stayed up and running. No real errors showed in the logs, except for Messages getting drop, which I believe was caused by what was going on with CPU and memory. Disabling row caching on this new column family has resolved the issue for now, but, is there something fundamental about row caching that I am missing? We are running Cassandra 1.1.2 with a 6 node cluster, with a replication factor of 3. Thanks, -Mike
Re: Row caching + Wide row column family == almost crashed?
Thanks for all the responses! On 12/3/2012 6:55 PM, Bill de hÓra wrote: A Cassandra JVM will generally not function well with with caches and wide rows. Probably the most important thing to understand is Ed's point, that the row cache caches the entire row, not just the slice that was read out. What you've seen is almost exactly the observed behaviour I'd expect with enabling either cache provider over wide rows. - the on-heap cache will result in evictions that crush the JVM trying to manage garbage. This is also the case so if the rows have an uneven size distribution (as small rows can push out a single large row, large rows push out many small ones, etc). - the off heap cache will spend a lot of time serializing and deserializing wide rows, such that it can increase latency relative to just reading from disk and leverage the filesystem's cache directly. The cache resizing behaviour does exist to preserve the server's memory, but it can also cause a death spiral in the on-heap case, because a relatively smaller cache may result in data being evicted more frequently. I've seen cases where sizing up the cache can stabilise a server's memory. This isn't just a Cassandra thing, it simply happens to be very evident with that system - generally to get an effective benefit from a cache, the data should be contiguously sized and not too large to allow effective cache 'lining'. Bill On 02/12/12 21:36, Mike wrote: Hello, We recently hit an issue within our Cassandra based application. We have a relatively new Column Family with some very wide rows (10's of thousands of columns, or more in some cases). During a periodic activity, we the range of columns to retrieve various pieces of information, a segment at a time. We do these same queries frequently at various stages of the process, and I thought the application could see a performance benefit from row caching. We have a small row cache (100MB per node) already enabled, and I enabled row caching on the new column family. The results were very negative. When performing range queries with a limit of 200 results, for a small minority of the rows in the new column family, performance plummeted. CPU utilization on the Cassandra node went through the roof, and it started chewing up memory. Some queries to this column family hung completely. According to the logs, we started getting frequent GCInspector messages. Cassandra started flushing the largest mem_tables due to hitting the "flush_largest_memtables_at" of 75%, and scaling back the key/row caches. However, to Cassandra's credit, it did not die with an OutOfMemory error. Its measures to emergency measures to conserve memory worked, and the cluster stayed up and running. No real errors showed in the logs, except for Messages getting drop, which I believe was caused by what was going on with CPU and memory. Disabling row caching on this new column family has resolved the issue for now, but, is there something fundamental about row caching that I am missing? We are running Cassandra 1.1.2 with a 6 node cluster, with a replication factor of 3. Thanks, -Mike
Diagnosing memory issues
usLogger.java (line 116) system.LocationInfo 0,0 INFO [ScheduledTasks:1] 2012-12-04 09:00:37,705 StatusLogger.java (line 116) system.Versions 0,0 INFO [ScheduledTasks:1] 2012-12-04 09:00:37,705 StatusLogger.java (line 116) system.schema_keyspaces 0,0 INFO [ScheduledTasks:1] 2012-12-04 09:00:37,705 StatusLogger.java (line 116) system.Migrations 0,0 INFO [ScheduledTasks:1] 2012-12-04 09:00:37,706 StatusLogger.java (line 116) system.schema_columnfamilies 0,0 INFO [ScheduledTasks:1] 2012-12-04 09:00:37,706 StatusLogger.java (line 116) system.schema_columns 0,0 INFO [ScheduledTasks:1] 2012-12-04 09:00:37,706 StatusLogger.java (line 116) system.HintsColumnFamily 0,0 INFO [ScheduledTasks:1] 2012-12-04 09:00:37,706 StatusLogger.java (line 116) system.Schema 0,0 INFO [ScheduledTasks:1] 2012-12-04 09:00:37,707 StatusLogger.java (line 116) open.comp 0,0 INFO [ScheduledTasks:1] 2012-12-04 09:00:37,707 StatusLogger.java (line 116) open.bp0,0 INFO [ScheduledTasks:1] 2012-12-04 09:00:37,707 StatusLogger.java (line 116) open.bn 312832,47184787 INFO [ScheduledTasks:1] 2012-12-04 09:00:37,707 StatusLogger.java (line 116) open.p 711,193201 INFO [ScheduledTasks:1] 2012-12-04 09:00:37,707 StatusLogger.java (line 116) open.bid273064,46316018 INFO [ScheduledTasks:1] 2012-12-04 09:00:37,708 StatusLogger.java (line 116) open.rel 0,0 INFO [ScheduledTasks:1] 2012-12-04 09:00:37,708 StatusLogger.java (line 116) open.images 0,0 INFO [ScheduledTasks:1] 2012-12-04 09:00:37,708 StatusLogger.java (line 116) open.users62287,86665510 INFO [ScheduledTasks:1] 2012-12-04 09:00:37,708 StatusLogger.java (line 116) open.sessions 4710,13153051 INFO [ScheduledTasks:1] 2012-12-04 09:00:37,709 StatusLogger.java (line 116) open.userIndices 4,1960 INFO [ScheduledTasks:1] 2012-12-04 09:00:37,709 StatusLogger.java (line 116) open.caches 50,4813457 INFO [ScheduledTasks:1] 2012-12-04 09:00:37,709 StatusLogger.java (line 116) open.content 0,0 INFO [ScheduledTasks:1] 2012-12-04 09:00:37,710 StatusLogger.java (line 116) open.enrich30,20793 INFO [ScheduledTasks:1] 2012-12-04 09:00:37,744 StatusLogger.java (line 116) open.bt1133,776831 INFO [ScheduledTasks:1] 2012-12-04 09:00:37,863 StatusLogger.java (line 116) open.alias 253,163933 INFO [ScheduledTasks:1] 2012-12-04 09:00:37,864 StatusLogger.java (line 116) open.bymsgid 249610,73075517 INFO [ScheduledTasks:1] 2012-12-04 09:00:37,864 StatusLogger.java (line 116) open.rank319956,70898417 INFO [ScheduledTasks:1] 2012-12-04 09:00:37,865 StatusLogger.java (line 116) open.cmap 448,406193 INFO [ScheduledTasks:1] 2012-12-04 09:00:37,865 StatusLogger.java (line 116) open.pmap659,566220 INFO [ScheduledTasks:1] 2012-12-04 09:00:37,865 StatusLogger.java (line 116) open.pict 50944,58659596 INFO [ScheduledTasks:1] 2012-12-04 09:00:37,878 StatusLogger.java (line 116) open.w0,0 INFO [ScheduledTasks:1] 2012-12-04 09:00:37,879 StatusLogger.java (line 116) open.s 92395,46160381 INFO [ScheduledTasks:1] 2012-12-04 09:00:37,879 StatusLogger.java (line 116) open.bymrel 136607,57780555 INFO [ScheduledTasks:1] 2012-12-04 09:00:37,879 StatusLogger.java (line 116) open.m 26720,51150067 It's appreciated, Thanks, -Mike
Re: Diagnosing memory issues
Thank you for the response. Since the time of this question, we've identified a number of areas that needed improving and have helped things along quite a bit. To answer your question, we were seeing both ParNew and CMS. There were no errors in the log, and all the nodes have been up. However, we are seeing one interesting issue. We are running a 6 node cluster with a Replication Factor of 3. The nodes are pretty evenly balanced. All reads and writes to Cassandra uses LOCAL_QUORUM consistency. We are seeing a very interesting problem from the JMX statistics. We discovered we had one column family with an extremely high and unexpected write count. The writes to this column family are done in conjunction with other writes to other column families such that their numbers should be roughly equivalent, but they are off by a factor of 10. We have yet to find anything in our code that could cause this discrepancy in numbers. What really interesting is that we see this behavior on only 5 of the 6 nodes in our cluster. On 5 of the 6 nodes, we see statistics indicating we are writing two fast and this specific memtable is exceeding its memtable 128M limit, while this one other node seems to be handling the load ok (Memtables stay within their limites). Given our replication factor, I'm not sure how this is possible. Any hints on what might be causing this additional load? Are there other activities in Cassandra might account for this increased load on a single column family? Any insights would be appreciated, -Mike On 12/4/2012 3:33 PM, aaron morton wrote: For background, a discussion on estimating working set http://www.mail-archive.com/user@cassandra.apache.org/msg25762.html . You can also just look at the size of tenured heap after a CMS. Are you seeing lots of ParNew or CMS ? GC activity is a result of configuration *and* workload. Look in your data model for wide rows, or long lived rows that get a lot of deletes, and look in your code for large reads / writes (e.g. sometimes we read 100,000 columns from a row). The number that really jumps out at me below is the number of Pending requests for the Message Service. 24,000+ pending requests. INFO [ScheduledTasks:1] 2012-12-04 09:00:37,702 StatusLogger.java (line 89) MessagingServicen/a24,229 Technically speaking that ain't right. The whole server looks unhappy. Are there any errors in the logs ? Are all the nodes up ? A very blunt approach is to reduce the in_memory_compaction_limit and the concurrent_compactors or compaction_throughput_mb_per_sec. This reduces the impact compaction and repair have on the system and may give you breathing space to look at other causes. Once you have a feel for what's going on you can turn them up. Hope that helps. A - Aaron Morton Freelance Cassandra Developer New Zealand @aaronmorton http://www.thelastpickle.com On 5/12/2012, at 7:04 AM, Mike <mailto:mthero...@yahoo.com>> wrote: Hello, Our Cassandra cluster has, relatively recently, started experiencing memory pressure that I am in the midsts of diagnosing. Our system has uneven levels of traffic, relatively light during the day, but extremely heavy during some overnight processing. We have started getting a message: WARN [ScheduledTasks:1] 2012-12-04 09:08:58,579 GCInspector.java (line 145) Heap is 0.7520105072262254 full. You may need to reduce memtable and/or cache sizes. Cassandra will now flush up to the two largest memtables to free up memory. Adjust flush_largest_memtables_at threshold in cassandra.yaml if you don't want Cassandra to do this automatically I've started implementing some instrumentation to gather stats from JMX to determine what is happening. However, last night, the GCInspector was kind enough to log the information below. Couple of things jumped out at me. The maximum heap for the Cassandra is 4GB. We are running Cassandra 1.1.2, on a 6 node cluster, with a replication factor of 3. All our queries use LOCAL_QUORUM consistency. Adding up the caches + the memtable "data" in the trace below, comes to under 600MB The number that really jumps out at me below is the number of Pending requests for the Message Service. 24,000+ pending requests. Does this number represent the number of outstanding client requests that this node is processing? If so, does this mean we potentially have 24,000 responses being pulled into memory, thereby causing this memory issue? What else should I look at? INFO [ScheduledTasks:1] 2012-12-04 09:00:37,585 StatusLogger.java (line 57) Pool NameActive Pending Blocked INFO [ScheduledTasks:1] 2012-12-04 09:00:37,695 StatusLogger.java (line 72) ReadStage3266 0 INFO [ScheduledTasks:1] 2012-12-04 09:00:37,696 StatusLogger.java (line 72) RequestResponseStage 0
Re: Read operations resulting in a write?
Thank you Aaron, this was very helpful. Could it be an issue that this optimization does not really take effect until the memtable with the hoisted data is flushed? In my simple example below, the same row is updated and multiple selects of the same row will result in multiple writes to the memtable. It seems it maybe possible (although unlikely) where, if you go from a write-mostly to a read-mostly scenario, you could get into a state where you are stuck rewriting to the same memtable, and the memtable is not flushed because it absorbs the over-writes. I can foresee this especially if you are reading the same rows repeatedly. I also noticed from the codepaths that if Row caching is enabled, this optimization will not occur. We made some changes this weekend to make this column family more suitable to row-caching and enabled row-caching with a small cache. Our initial results is that it seems to have corrected the write counts, and has increased performance quite a bit. However, are there any hidden gotcha's there because this optimization is not occurring? https://issues.apache.org/jira/browse/CASSANDRA-2503 mentions a "compaction is behind" problem. Any history on that? I couldn't find too much information on it. Thanks, -Mike On 12/16/2012 8:41 PM, aaron morton wrote: 1) Am I reading things correctly? Yes. If you do a read/slice by name and more than min compaction level nodes where read the data is re-written so that the next read uses fewer SSTables. 2) What is really happening here? Essentially minor compactions can occur between 4 and 32 memtable flushes. Looking through the code, this seems to only effect a couple types of select statements (when selecting a specific column on a specific key being one of them). During the time between these two values, every "select" statement will perform a write. Yup, only for readying a row where the column names are specified. Remember minor compaction when using SizedTiered Compaction (the default) works on buckets of the same size. Imagine a row that had been around for a while and had fragments in more than Min Compaction Threshold sstables. Say it is 3 SSTables in the 2nd tier and 2 sstables in the 1st. So it takes (potentially) 5 SSTable reads. If this row is read it will get hoisted back up. But the row has is in only 1 SSTable in the 2nd tier and 2 in the 1st tier it will not hoisted. There are a few short circuits in the SliceByName read path. One of them is to end the search when we know that no other SSTables contain columns that should be considered. So if the 4 columns you read frequently are hoisted into the 1st bucket your reads will get handled by that one bucket. It's not every select. Just those that touched more the min compaction sstables. 3) Is this desired behavior? Is there something else I should be looking at that could be causing this behavior? Yes. https://issues.apache.org/jira/browse/CASSANDRA-2503 Cheers - Aaron Morton Freelance Cassandra Developer New Zealand @aaronmorton http://www.thelastpickle.com On 15/12/2012, at 12:58 PM, Michael Theroux <mailto:mthero...@yahoo.com>> wrote: Hello, We have an unusual situation that I believe I've reproduced, at least temporarily, in a test environment. I also think I see where this issue is occurring in the code. We have a specific column family that is under heavy read and write load on a nightly basis. For the purposes of this description, I'll refer to this column family as "Bob". During this nightly processing, sometimes Bob is under very write load, other times it is very heavy read load. The application is such that when something is written to Bob, a write is made to one of two other tables. We've witnessed a situation where the write count on Bob far outstrips the write count on either of the other tables, by a factor of 3->10. This is based on the WriteCount available on the column family JMX MBean. We have not been able to find where in our code this is happening, and we have gone as far as tracing our CQL calls to determine that the relationship between Bob and the other tables are what we expect. I brought up a test node to experiment, and see a situation where, when a "select" statement is executed, a write will occur. In my test, I perform the following (switching between nodetool and cqlsh): update bob set 'about'='coworker' where key=''; nodetool flush update bob set 'about'='coworker' where key=''; nodetool flush update bob set 'about'='coworker' where key=''; nodetool flush update bob set 'about'='coworker' where key=''; nodetool flush update bob set 'about'='coworker' where key=''; nodetool flush Then, for a period of
Column Family migration/tombstones
Hello, We are undergoing a change to our internal datamodel that will result in the eventual deletion of over a hundred million rows from a Cassandra column family. From what I understand, this will result in the generation of tombstones, which will be cleaned up during compaction, after gc_grace_period time (default: 10 days). A couple of questions: 1) As one can imagine, the index and bloom filter for this column family is large. Am I correct to assume that bloom filter and index space will not be reduced until after gc_grace_period? 2) If I would manually run repair across a cluster, is there a process I can use to safely remove these tombstones before gc_grace period to free this memory sooner? 3) Any words of warning when undergoing this? We are running Cassandra 1.1.2 on a 6 node cluster and a Replication Factor of 3. We use LOCAL_QUORM consistency for all operations. Thanks! -Mike
Re: Column Family migration/tombstones
A couple more questions. When these rows are deleted, tombstones will be created and stored in more recent sstables. Upon compaction of sstables, and after gc_grace_period, I presume cassandra will have removed all traces of that row from disk. However, after deleting such a large amount of information, there is no guarantee that Cassandra will compact these two tables together, causing the data to be deleted (right?). Therefore, even after gc_grace_period, a large amount of space may still be used. Is there a way, other than a major compaction, to clean up all this old data? I assume a nodetool scrub will cleanup old tombstones only if that row is not in another sstable? Do tombstones take up bloomfilter space after gc_grace_period? -Mike On 1/2/2013 6:41 PM, aaron morton wrote: 1) As one can imagine, the index and bloom filter for this column family is large. Am I correct to assume that bloom filter and index space will not be reduced until after gc_grace_period? Yes. 2) If I would manually run repair across a cluster, is there a process I can use to safely remove these tombstones before gc_grace period to free this memory sooner? There is nothing to specifically purge tombstones. You can temporarily reduce the gc_grace_seconds and then trigger compaction. Either by reducing the min_compaction_threshold to 2 and doing a flush. Or by kicking of a user defined compaction using the JMX interface. 3) Any words of warning when undergoing this? Make sure you have a good breakfast. (It's more general advice than Cassandra specific.) Cheers - Aaron Morton Freelance Cassandra Developer New Zealand @aaronmorton http://www.thelastpickle.com On 30/12/2012, at 8:51 AM, Mike wrote: Hello, We are undergoing a change to our internal datamodel that will result in the eventual deletion of over a hundred million rows from a Cassandra column family. From what I understand, this will result in the generation of tombstones, which will be cleaned up during compaction, after gc_grace_period time (default: 10 days). A couple of questions: 1) As one can imagine, the index and bloom filter for this column family is large. Am I correct to assume that bloom filter and index space will not be reduced until after gc_grace_period? 2) If I would manually run repair across a cluster, is there a process I can use to safely remove these tombstones before gc_grace period to free this memory sooner? 3) Any words of warning when undergoing this? We are running Cassandra 1.1.2 on a 6 node cluster and a Replication Factor of 3. We use LOCAL_QUORM consistency for all operations. Thanks! -Mike
Re: Column Family migration/tombstones
Thanks Aaron, I appreciate it. It is my understanding, major compactions are not recommended because it will essentially create one massive SSTable that will not compact with any new SSTables for some time. I can see how this might be a performance concern in the general case, because any read operation would always require multiple disk reads across multiple SSTables. In addition, information in the new table will not be purged due to subsequent tombstones until that table can be compacted. This might then require regular major compactions to be able to clear that data. Are there other performance considerations that I need to keep in mind? However, this might not be as much of an issue in our usecase. It just so happens, the data in this column family is changed very infrequently, except for deletes (as of recently, and will now occur over time). In these case, I don't believe having data spread across the SSTables will be an issue, as either the data will have a tombstone (which causes cassandra to stop looking at other SSTables), or that data will be in one SSTable. So I do not believe I/O will end up being an issue here. What may be an issue is cleaning out old data in the SSTable that will exist after a major compaction. However, this might not require major compactions to happen nearly as frequently as I've seen recommended (once every gc_grace period), or at all. With the new design, data will be deleted from this table after a number of days. Deletes again the remaining data after a major compaction might not get processed until the next major compaction, but any deletes against new data should be deleted normally through minor compactions. In addition, the remaining data after we are complete the migration should be fairly small (about 500,000 skinny rows per node, including replicas). Any other thoughts on this? -Mike On 1/6/2013 3:49 PM, aaron morton wrote: When these rows are deleted, tombstones will be created and stored in more recent sstables. Upon compaction of sstables, and after gc_grace_period, I presume cassandra will have removed all traces of that row from disk. Yes. When using Size Tiered compaction (the default) tombstones are purged when all fragments of a row are included in a compaction. So if you have rows which are written to for A Very Long Time(™) it can take a while for everything to get purged. In the normal case though it's not a concern. However, after deleting such a large amount of information, there is no guarantee that Cassandra will compact these two tables together, causing the data to be deleted (right?). Therefore, even after gc_grace_period, a large amount of space may still be used. In the normal case this is not really an issue. In your case things sound a little non normal. If you will have only a few hundred MB's, or a few GB's, of data level in the CF I would consider running a major compaction on it. Major compaction will work on all SSTables and create one big SSTable, this will ensure all deleted data is deleted. We normally caution agains this as the one new file is often very big and will not get compacted for a while. However if you are deleting lots-o-data it may work. (There is also an anti compaction script around that may be of use.) Another alternative is to compact some of the older sstables with newer ones via User Defined Compaction with JMX. Is there a way, other than a major compaction, to clean up all this old data? I assume a nodetool scrub will cleanup old tombstones only if that row is not in another sstable? I don't think scrub (or upgradesstables) remove tombstones. Do tombstones take up bloomfilter space after gc_grace_period? Any row, regardless of the liveness of the columns, takes up bloom filter space (in -Filter.db). Once the row is removed it will no longer take up space. Cheers - Aaron Morton Freelance Cassandra Developer New Zealand @aaronmorton http://www.thelastpickle.com On 6/01/2013, at 6:44 AM, Mike wrote: A couple more questions. When these rows are deleted, tombstones will be created and stored in more recent sstables. Upon compaction of sstables, and after gc_grace_period, I presume cassandra will have removed all traces of that row from disk. However, after deleting such a large amount of information, there is no guarantee that Cassandra will compact these two tables together, causing the data to be deleted (right?). Therefore, even after gc_grace_period, a large amount of space may still be used. Is there a way, other than a major compaction, to clean up all this old data? I assume a nodetool scrub will cleanup old tombstones only if that row is not in another sstable? Do tombstones take up bloomfilter space after gc_grace_period? -Mike On 1/2/2013 6:41 PM, aaron morton wrote: 1) As one can imagine, the index and bloom filter for this column family is large. Am
Re: Column Family migration/tombstones
Thanks, Another related question. In the situation described below, where we have a row and a tombstone across more than one SSTable, and it would take a very long time for these SSTables to be compacted, are there two rows being tracked by bloomfilters (since there is a bloom filter per SSTable), or does Cassandra possibly do something more efficient? To extend the example, if I delete a 1,000,000 rows, and that SSTable containing 1,000,000 tombstones is not compacted with the other SSTables containing those rows, are bloomfilters accounting for 2,000,000 rows, or 1,000,000? This is more related to the current activities of deletion, as opposed to a major compaction (although the question is applicable to both). As we delete rows, will our bloomfilters grow? -Mike On 1/6/2013 3:49 PM, aaron morton wrote: When these rows are deleted, tombstones will be created and stored in more recent sstables. Upon compaction of sstables, and after gc_grace_period, I presume cassandra will have removed all traces of that row from disk. Yes. When using Size Tiered compaction (the default) tombstones are purged when all fragments of a row are included in a compaction. So if you have rows which are written to for A Very Long Time(™) it can take a while for everything to get purged. In the normal case though it's not a concern. However, after deleting such a large amount of information, there is no guarantee that Cassandra will compact these two tables together, causing the data to be deleted (right?). Therefore, even after gc_grace_period, a large amount of space may still be used. In the normal case this is not really an issue. In your case things sound a little non normal. If you will have only a few hundred MB's, or a few GB's, of data level in the CF I would consider running a major compaction on it. Major compaction will work on all SSTables and create one big SSTable, this will ensure all deleted data is deleted. We normally caution agains this as the one new file is often very big and will not get compacted for a while. However if you are deleting lots-o-data it may work. (There is also an anti compaction script around that may be of use.) Another alternative is to compact some of the older sstables with newer ones via User Defined Compaction with JMX. Is there a way, other than a major compaction, to clean up all this old data? I assume a nodetool scrub will cleanup old tombstones only if that row is not in another sstable? I don't think scrub (or upgradesstables) remove tombstones. Do tombstones take up bloomfilter space after gc_grace_period? Any row, regardless of the liveness of the columns, takes up bloom filter space (in -Filter.db). Once the row is removed it will no longer take up space. Cheers - Aaron Morton Freelance Cassandra Developer New Zealand @aaronmorton http://www.thelastpickle.com On 6/01/2013, at 6:44 AM, Mike wrote: A couple more questions. When these rows are deleted, tombstones will be created and stored in more recent sstables. Upon compaction of sstables, and after gc_grace_period, I presume cassandra will have removed all traces of that row from disk. However, after deleting such a large amount of information, there is no guarantee that Cassandra will compact these two tables together, causing the data to be deleted (right?). Therefore, even after gc_grace_period, a large amount of space may still be used. Is there a way, other than a major compaction, to clean up all this old data? I assume a nodetool scrub will cleanup old tombstones only if that row is not in another sstable? Do tombstones take up bloomfilter space after gc_grace_period? -Mike On 1/2/2013 6:41 PM, aaron morton wrote: 1) As one can imagine, the index and bloom filter for this column family is large. Am I correct to assume that bloom filter and index space will not be reduced until after gc_grace_period? Yes. 2) If I would manually run repair across a cluster, is there a process I can use to safely remove these tombstones before gc_grace period to free this memory sooner? There is nothing to specifically purge tombstones. You can temporarily reduce the gc_grace_seconds and then trigger compaction. Either by reducing the min_compaction_threshold to 2 and doing a flush. Or by kicking of a user defined compaction using the JMX interface. 3) Any words of warning when undergoing this? Make sure you have a good breakfast. (It's more general advice than Cassandra specific.) Cheers - Aaron Morton Freelance Cassandra Developer New Zealand @aaronmorton http://www.thelastpickle.com On 30/12/2012, at 8:51 AM, Mike wrote: Hello, We are undergoing a change to our internal datamodel that will result in the eventual deletion of over a hundred million rows from a Cassandra column family. From what I understand, this will result in the generation of tombstones, which
Cassandra 1.1.2 -> 1.1.8 upgrade
Hello, We are looking to upgrade our Cassandra cluster from 1.1.2 -> 1.1.8 (or possibly 1.1.9 depending on timing). It is my understanding that rolling upgrades of Cassandra is supported, so as we upgrade our cluster, we can do so one node at a time without experiencing downtime. Has anyone had any gotchas recently that I should be aware of before performing this upgrade? In order to upgrade, is the only thing that needs to change are the JAR files? Can everything remain as-is? Thanks, -Mike
Re: Cassandra 1.1.2 -> 1.1.8 upgrade
Thanks for pointing that out. Given upgradesstables can only be run on a live node, does anyone know if there is a danger of having this node in the cluster while this is being performed? Also, can anyone confirm this only needs to be done on counter counter column families, or all column families (the former makes sense, I'm just making sure). -Mike On 1/16/2013 11:08 AM, Jason Wee wrote: always check NEWS.txt for instance for cassandra 1.1.3 you need to run nodetool upgradesstables if your cf has counter. On Wed, Jan 16, 2013 at 11:58 PM, Mike <mailto:mthero...@yahoo.com>> wrote: Hello, We are looking to upgrade our Cassandra cluster from 1.1.2 -> 1.1.8 (or possibly 1.1.9 depending on timing). It is my understanding that rolling upgrades of Cassandra is supported, so as we upgrade our cluster, we can do so one node at a time without experiencing downtime. Has anyone had any gotchas recently that I should be aware of before performing this upgrade? In order to upgrade, is the only thing that needs to change are the JAR files? Can everything remain as-is? Thanks, -Mike
Cassandra flush spin?
Hello, We just hit a very odd issue in our Cassandra cluster. We are running Cassandra 1.1.2 in a 6 node cluster. We use a replication factor of 3, and all operations utilize LOCAL_QUORUM consistency. We noticed a large performance hit in our application's maintenance activities and I've been investigating. I discovered a node in the cluster that was flushing a memtable like crazy. It was flushing every 2->3 minutes, and has been apparently doing this for days. Typically, during this time of day, a flush would happen every 30 minutes or so. alldb.sh "cat /var/log/cassandra/system.log | grep \"flushing high-traffic column family CFS(Keyspace='open', ColumnFamily='msgs')\" | grep 02-08 | wc -l" [1] 18:41:04 [SUCCESS] db-1c-1 59 [2] 18:41:05 [SUCCESS] db-1c-2 48 [3] 18:41:05 [SUCCESS] db-1a-1 1206 [4] 18:41:05 [SUCCESS] db-1d-2 54 [5] 18:41:05 [SUCCESS] db-1a-2 56 [6] 18:41:05 [SUCCESS] db-1d-1 52 I restarted the database node, and, at least for now, the problem appears to have stopped. There are a number of things that don't make sense here. We use a replication factor of 3, so if this was being caused by our application, I would have expected 3 nodes in the cluster to have issues. Also, I would have expected the issue to continue once the node restarted. Another information point of interest, and I'm wondering if its exposed a bug, was this node was recently converted to use ephemeral storage on EC2, and was restored from a snapshot. After the restore, a nodetool repair was run. However, repair was going to run into some heavy activity for our application, and we canceled that validation compaction (2 of the 3 anti-entropy sessions had completed). The spin appears to have started at the start of the second session. Any hints? -Mike
Re: Cassandra 1.1.2 -> 1.1.8 upgrade
Thank you, Another question on this topic. Upgrading from 1.1.2->1.1.9 requires running upgradesstables, which will take many hours on our dataset (about 12). For this upgrade, is it recommended that I: 1) Upgrade all the DB nodes to 1.1.9 first, then go around the ring and run a staggered upgrade of the sstables over a number of days. 2) Upgrade one node at a time, running the clustered in a mixed 1.1.2->1.1.9 configuration for a number of days. I would prefer #1, as with #2, streaming will not work until all the nodes are upgraded. I appreciate your thoughts, -Mike On 1/16/2013 11:08 AM, Jason Wee wrote: always check NEWS.txt for instance for cassandra 1.1.3 you need to run nodetool upgradesstables if your cf has counter. On Wed, Jan 16, 2013 at 11:58 PM, Mike <mailto:mthero...@yahoo.com>> wrote: Hello, We are looking to upgrade our Cassandra cluster from 1.1.2 -> 1.1.8 (or possibly 1.1.9 depending on timing). It is my understanding that rolling upgrades of Cassandra is supported, so as we upgrade our cluster, we can do so one node at a time without experiencing downtime. Has anyone had any gotchas recently that I should be aware of before performing this upgrade? In order to upgrade, is the only thing that needs to change are the JAR files? Can everything remain as-is? Thanks, -Mike
Re: Cassandra 1.1.2 -> 1.1.8 upgrade
So the upgrade sstables is recommended as part of the upgrade to 1.1.3 if you are using counter columns Also, there was a general recommendation (in another response to my question) to run upgrade sstables because of: "upgradesstables always needs to be done between majors. While 1.1.2 -> 1.1.8 is not a major, due to an unforeseen bug in the conversion to microseconds you'll need to run upgradesstables." Is this referring to: https://issues.apache.org/jira/browse/CASSANDRA-4432 Can anyone know the impact of not running upgrade sstables? Or possible not running it for several days? Thanks, -Mike On 2/10/2013 3:27 PM, aaron morton wrote: I would do #1. You can play with nodetool setcompactionthroughput to speed things up, but beware nothing comes for free. Cheers - Aaron Morton Freelance Cassandra Developer New Zealand @aaronmorton http://www.thelastpickle.com On 10/02/2013, at 6:40 AM, Mike <mailto:mthero...@yahoo.com>> wrote: Thank you, Another question on this topic. Upgrading from 1.1.2->1.1.9 requires running upgradesstables, which will take many hours on our dataset (about 12). For this upgrade, is it recommended that I: 1) Upgrade all the DB nodes to 1.1.9 first, then go around the ring and run a staggered upgrade of the sstables over a number of days. 2) Upgrade one node at a time, running the clustered in a mixed 1.1.2->1.1.9 configuration for a number of days. I would prefer #1, as with #2, streaming will not work until all the nodes are upgraded. I appreciate your thoughts, -Mike On 1/16/2013 11:08 AM, Jason Wee wrote: always check NEWS.txt for instance for cassandra 1.1.3 you need to run nodetool upgradesstables if your cf has counter. On Wed, Jan 16, 2013 at 11:58 PM, Mike <mailto:mthero...@yahoo.com>> wrote: Hello, We are looking to upgrade our Cassandra cluster from 1.1.2 -> 1.1.8 (or possibly 1.1.9 depending on timing). It is my understanding that rolling upgrades of Cassandra is supported, so as we upgrade our cluster, we can do so one node at a time without experiencing downtime. Has anyone had any gotchas recently that I should be aware of before performing this upgrade? In order to upgrade, is the only thing that needs to change are the JAR files? Can everything remain as-is? Thanks, -Mike
Size Tiered -> Leveled Compaction
Hello, I'm investigating the transition of some of our column families from Size Tiered -> Leveled Compaction. I believe we have some high-read-load column families that would benefit tremendously. I've stood up a test DB Node to investigate the transition. I successfully alter the column family, and I immediately noticed a large number (1000+) pending compaction tasks become available, but no compaction get executed. I tried running "nodetool sstableupgrade" on the column family, and the compaction tasks don't move. I also notice no changes to the size and distribution of the existing SSTables. I then run a major compaction on the column family. All pending compaction tasks get run, and the SSTables have a distribution that I would expect from LeveledCompaction (lots and lots of 10MB files). Couple of questions: 1) Is a major compaction required to transition from size-tiered to leveled compaction? 2) Are major compactions as much of a concern for LeveledCompaction as their are for Size Tiered? All the documentation I found concerning transitioning from Size Tiered to Level compaction discuss the alter table cql command, but I haven't found too much on what else needs to be done after the schema change. I did these tests with Cassandra 1.1.9. Thanks, -Mike
Unbalanced ring after upgrade!
Hello, We just upgraded from 1.1.2->1.1.9. We utilize the byte ordered partitioner (we generate our own hashes). We have not yet upgraded sstables. Before the upgrade, we had a balanced ring. After the upgrade, we see: 10.0.4.22 us-east 1a Up Normal 77.66 GB 0.04% Token(bytes[0001]) 10.0.10.23 us-east 1d Up Normal 82.74 GB 0.04% Token(bytes[1555]) 10.0.8.20 us-east 1c Up Normal 81.79 GB 0.04% Token(bytes[2aaa]) 10.0.4.23 us-east 1a Up Normal 82.66 GB 33.84% Token(bytes[4000]) 10.0.10.20 us-east 1d Up Normal 80.21 GB 67.51% Token(bytes[5554]) 10.0.8.23 us-east 1c Up Normal 77.12 GB 99.89% Token(bytes[6aac]) 10.0.4.21 us-east 1a Up Normal 81.38 GB 66.09% Token(bytes[8000]) 10.0.10.24 us-east 1d Up Normal 83.43 GB 32.41% Token(bytes[9558]) 10.0.8.21 us-east 1c Up Normal 84.42 GB 0.04% Token(bytes[aaa8]) 10.0.4.25 us-east 1a Up Normal 80.06 GB 0.04% Token(bytes[c000]) 10.0.10.21 us-east 1d Up Normal 83.57 GB 0.04% Token(bytes[d558]) 10.0.8.24 us-east 1c Up Normal 90.74 GB 0.04% Token(bytes[eaa8]) Restarting a node essentially changes who own 99% of the ring. Given we use an RF of 3, and LOCAL_QUORUM consistency for everything, and we are not seeing errors, something seems to be working correctly. Any idea what is going on above? Should I be alarmed? -Mike
Re: Unbalanced ring after upgrade!
Actually, doing a nodetool ring is always showing the current node as owning 99% of the ring From db-1a-1: Address DC RackStatus State Load Effective-Ownership Token Token(bytes[eaa8]) 10.0.4.22 us-east 1a Up Normal 77.72 GB 99.89% Token(bytes[0001]) 10.0.10.23 us-east 1d Up Normal 82.74 GB 64.13% Token(bytes[1555]) 10.0.8.20 us-east 1c Up Normal 81.79 GB 30.55% Token(bytes[2aaa]) 10.0.4.23 us-east 1a Up Normal 82.66 GB 0.04% Token(bytes[4000]) 10.0.10.20 us-east 1d Up Normal 80.21 GB 0.04% Token(bytes[5554]) 10.0.8.23 us-east 1c Up Normal 77.07 GB 0.04% Token(bytes[6aac]) 10.0.4.21 us-east 1a Up Normal 81.38 GB 0.04% Token(bytes[8000]) 10.0.10.24 us-east 1d Up Normal 83.43 GB 0.04% Token(bytes[9558]) 10.0.8.21 us-east 1c Up Normal 84.42 GB 0.04% Token(bytes[aaa8]) 10.0.4.25 us-east 1a Up Normal 80.06 GB 0.04% Token(bytes[c000]) 10.0.10.21 us-east 1d Up Normal 83.49 GB 35.80% Token(bytes[d558]) 10.0.8.24 us-east 1c Up Normal 90.72 GB 69.37% Token(bytes[eaa8]) From db-1c-3: Address DC RackStatus State Load Effective-Ownership Token Token(bytes[eaa8]) 10.0.4.22 us-east 1a Up Normal 77.72 GB 0.04% Token(bytes[0001]) 10.0.10.23 us-east 1d Up Normal 82.78 GB 0.04% Token(bytes[1555]) 10.0.8.20 us-east 1c Up Normal 81.79 GB 0.04% Token(bytes[2aaa]) 10.0.4.23 us-east 1a Up Normal 82.66 GB 33.84% Token(bytes[4000]) 10.0.10.20 us-east 1d Up Normal 80.21 GB 67.51% Token(bytes[5554]) 10.0.8.23 us-east 1c Up Normal 77.07 GB 99.89% Token(bytes[6aac]) 10.0.4.21 us-east 1a Up Normal 81.38 GB 66.09% Token(bytes[8000]) 10.0.10.24 us-east 1d Up Normal 83.43 GB 32.41% Token(bytes[9558]) 10.0.8.21 us-east 1c Up Normal 84.42 GB 0.04% Token(bytes[aaa8]) 10.0.4.25 us-east 1a Up Normal 80.06 GB 0.04% Token(bytes[c000]) 10.0.10.21 us-east 1d Up Normal 83.49 GB 0.04% Token(bytes[d558]) 10.0.8.24 us-east 1c Up Normal 90.72 GB 0.04% Token(bytes[eaa8]) Any help would be appreciated, as if something is going drastically wrong we need to go back to backups and revert back to 1.1.2. Thanks, -Mike On 2/14/2013 8:32 AM, Mike wrote: Hello, We just upgraded from 1.1.2->1.1.9. We utilize the byte ordered partitioner (we generate our own hashes). We have not yet upgraded sstables. Before the upgrade, we had a balanced ring. After the upgrade, we see: 10.0.4.22 us-east 1a Up Normal 77.66 GB 0.04% Token(bytes[0001]) 10.0.10.23 us-east 1d Up Normal 82.74 GB 0.04% Token(bytes[1555]) 10.0.8.20 us-east 1c Up Normal 81.79 GB 0.04% Token(bytes[2aaa]) 10.0.4.23 us-east 1a Up Normal 82.66 GB 33.84% Token(bytes[4000]) 10.0.10.20 us-east 1d Up Normal 80.21 GB 67.51% Token(bytes[5554]) 10.0.8.23 us-east 1c Up Normal 77.12 GB 99.89% Token(bytes[6aac]) 10.0.4.21 us-east 1a Up Normal 81.38 GB 66.09% Token(bytes[8000]) 10.0.10.24 us-east 1d Up Normal 83.43 GB 32.41% Token(bytes[9558]) 10.0.8.21 us-east 1c Up Normal 84.42 GB 0.04% Token(bytes[aaa8]) 10.0.4.25 us-e
Re: Deletion consistency
If you increase the number of nodes to 3, with an RF of 3, then you should be able to read/delete utilizing a quorum consistency level, which I believe will help here. Also, make sure the time of your servers are in sync, utilizing NTP, as drifting time between you client and server could cause updates to be mistakenly dropped for being old. Also, make sure you are running with a gc_grace period that is high enough. The default is 10 days. Hope this helps, -Mike On 2/15/2013 1:13 PM, Víctor Hugo Oliveira Molinar wrote: hello everyone! I have a column family filled with event objects which need to be processed by query threads. Once each thread query for those objects(spread among columns bellow a row), it performs a delete operation for each object in cassandra. It's done in order to ensure that these events wont be processed again. Some tests has showed me that it works, but sometimes i'm not getting those events deleted. I checked it through cassandra-cli,etc. So, reading it (http://wiki.apache.org/cassandra/DistributedDeletes) I came to a conclusion that I may be reading old data. My cluster is currently configured as: 2 nodes, RF1, CL 1. In that case, what should I do? - Increase the consistency level for the write operations( in that case, the deletions ). In order to ensure that those deletions are stored in all nodes. or - Increase the consistency level for the read operations. In order to ensure that I'm reading only those yet processed events(deleted). ? - Thanks in advance
Re: Size Tiered -> Leveled Compaction
Another piece of information that would be useful is advice on how to properly set the SSTable size for your usecase. I understand the default is 5MB, a lot of examples show the use of 10MB, and I've seen cases where people have set is as high as 200MB. Any information is appreciated, -Mike On 2/14/2013 4:10 PM, Michael Theroux wrote: BTW, when I say "major compaction", I mean running the "nodetool compact" command (which does a major compaction for Sized Tiered Compaction). I didn't see the distribution of SSTables I expected until I ran that command, in the steps I described below. -Mike On Feb 14, 2013, at 3:51 PM, Wei Zhu wrote: I haven't tried to switch compaction strategy. We started with LCS. For us, after massive data imports (5000 w/seconds for 6 days), the first repair is painful since there is quite some data inconsistency. For 150G nodes, repair brought in about 30 G and created thousands of pending compactions. It took almost a day to clear those. Just be prepared LCS is really slow in 1.1.X. System performance degrades during that time since reads could go to more SSTable, we see 20 SSTable lookup for one read.. (We tried everything we can and couldn't speed it up. I think it's single threaded and it's not recommended to turn on multithread compaction. We even tried that, it didn't help )There is parallel LCS in 1.2 which is supposed to alleviate the pain. Haven't upgraded yet, hope it works:) http://www.datastax.com/dev/blog/performance-improvements-in-cassandra-1-2 Since our cluster is not write intensive, only 100 w/seconds. I don't see any pending compactions during regular operation. One thing worth mentioning is the size of the SSTable, default is 5M which is kind of small for 200G (all in one CF) data set, and we are on SSD. It more than 150K files in one directory. (200G/5M = 40K SSTable and each SSTable creates 4 files on disk) You might want to watch that and decide the SSTable size. By the way, there is no concept of Major compaction for LCS. Just for fun, you can look at a file called $CFName.json in your data directory and it tells you the SSTable distribution among different levels. -Wei *From:* Charles Brophy mailto:cbro...@zulily.com>> *To:* user@cassandra.apache.org <mailto:user@cassandra.apache.org> *Sent:* Thursday, February 14, 2013 8:29 AM *Subject:* Re: Size Tiered -> Leveled Compaction I second these questions: we've been looking into changing some of our CFs to use leveled compaction as well. If anybody here has the wisdom to answer them it would be of wonderful help. Thanks Charles On Wed, Feb 13, 2013 at 7:50 AM, Mike <mailto:mthero...@yahoo.com>> wrote: Hello, I'm investigating the transition of some of our column families from Size Tiered -> Leveled Compaction. I believe we have some high-read-load column families that would benefit tremendously. I've stood up a test DB Node to investigate the transition. I successfully alter the column family, and I immediately noticed a large number (1000+) pending compaction tasks become available, but no compaction get executed. I tried running "nodetool sstableupgrade" on the column family, and the compaction tasks don't move. I also notice no changes to the size and distribution of the existing SSTables. I then run a major compaction on the column family. All pending compaction tasks get run, and the SSTables have a distribution that I would expect from LeveledCompaction (lots and lots of 10MB files). Couple of questions: 1) Is a major compaction required to transition from size-tiered to leveled compaction? 2) Are major compactions as much of a concern for LeveledCompaction as their are for Size Tiered? All the documentation I found concerning transitioning from Size Tiered to Level compaction discuss the alter table cql command, but I haven't found too much on what else needs to be done after the schema change. I did these tests with Cassandra 1.1.9. Thanks, -Mike
Re: Size Tiered -> Leveled Compaction
Hello Wei, First thanks for this response. Out of curiosity, what SSTable size did you choose for your usecase, and what made you decide on that number? Thanks, -Mike On 2/14/2013 3:51 PM, Wei Zhu wrote: I haven't tried to switch compaction strategy. We started with LCS. For us, after massive data imports (5000 w/seconds for 6 days), the first repair is painful since there is quite some data inconsistency. For 150G nodes, repair brought in about 30 G and created thousands of pending compactions. It took almost a day to clear those. Just be prepared LCS is really slow in 1.1.X. System performance degrades during that time since reads could go to more SSTable, we see 20 SSTable lookup for one read.. (We tried everything we can and couldn't speed it up. I think it's single threaded and it's not recommended to turn on multithread compaction. We even tried that, it didn't help )There is parallel LCS in 1.2 which is supposed to alleviate the pain. Haven't upgraded yet, hope it works:) http://www.datastax.com/dev/blog/performance-improvements-in-cassandra-1-2 Since our cluster is not write intensive, only 100 w/seconds. I don't see any pending compactions during regular operation. One thing worth mentioning is the size of the SSTable, default is 5M which is kind of small for 200G (all in one CF) data set, and we are on SSD. It more than 150K files in one directory. (200G/5M = 40K SSTable and each SSTable creates 4 files on disk) You might want to watch that and decide the SSTable size. By the way, there is no concept of Major compaction for LCS. Just for fun, you can look at a file called $CFName.json in your data directory and it tells you the SSTable distribution among different levels. -Wei *From:* Charles Brophy *To:* user@cassandra.apache.org *Sent:* Thursday, February 14, 2013 8:29 AM *Subject:* Re: Size Tiered -> Leveled Compaction I second these questions: we've been looking into changing some of our CFs to use leveled compaction as well. If anybody here has the wisdom to answer them it would be of wonderful help. Thanks Charles On Wed, Feb 13, 2013 at 7:50 AM, Mike <mailto:mthero...@yahoo.com>> wrote: Hello, I'm investigating the transition of some of our column families from Size Tiered -> Leveled Compaction. I believe we have some high-read-load column families that would benefit tremendously. I've stood up a test DB Node to investigate the transition. I successfully alter the column family, and I immediately noticed a large number (1000+) pending compaction tasks become available, but no compaction get executed. I tried running "nodetool sstableupgrade" on the column family, and the compaction tasks don't move. I also notice no changes to the size and distribution of the existing SSTables. I then run a major compaction on the column family. All pending compaction tasks get run, and the SSTables have a distribution that I would expect from LeveledCompaction (lots and lots of 10MB files). Couple of questions: 1) Is a major compaction required to transition from size-tiered to leveled compaction? 2) Are major compactions as much of a concern for LeveledCompaction as their are for Size Tiered? All the documentation I found concerning transitioning from Size Tiered to Level compaction discuss the alter table cql command, but I haven't found too much on what else needs to be done after the schema change. I did these tests with Cassandra 1.1.9. Thanks, -Mike
Re: Size Tiered -> Leveled Compaction
Hello, Still doing research before we potentially move one of our column families from Size Tiered->Leveled compaction this weekend. I was doing some research around some of the bugs that were filed against leveled compaction in Cassandra and I found this: https://issues.apache.org/jira/browse/CASSANDRA-4644 The bug mentions: "You need to run the offline scrub (bin/sstablescrub) to fix the sstable overlapping problem from early 1.1 releases. (Running with -m to just check for overlaps between sstables should be fine, since you already scrubbed online which will catch out-of-order within an sstable.)" We recently upgraded from 1.1.2 to 1.1.9. Does anyone know if an offline scrub is recommended to be performed when switching from STCS->LCS after upgrading from 1.1.2? Any insight would be appreciated, Thanks, -Mike On 2/17/2013 8:57 PM, Wei Zhu wrote: We doubled the SStable size to 10M. It still generates a lot of SSTable and we don't see much difference of the read latency. We are able to finish the compactions after repair within serveral hours. We will increase the SSTable size again if we feel the number of SSTable hurts the performance. - Original Message - From: "Mike" To: user@cassandra.apache.org Sent: Sunday, February 17, 2013 4:50:40 AM Subject: Re: Size Tiered -> Leveled Compaction Hello Wei, First thanks for this response. Out of curiosity, what SSTable size did you choose for your usecase, and what made you decide on that number? Thanks, -Mike On 2/14/2013 3:51 PM, Wei Zhu wrote: I haven't tried to switch compaction strategy. We started with LCS. For us, after massive data imports (5000 w/seconds for 6 days), the first repair is painful since there is quite some data inconsistency. For 150G nodes, repair brought in about 30 G and created thousands of pending compactions. It took almost a day to clear those. Just be prepared LCS is really slow in 1.1.X. System performance degrades during that time since reads could go to more SSTable, we see 20 SSTable lookup for one read.. (We tried everything we can and couldn't speed it up. I think it's single threaded and it's not recommended to turn on multithread compaction. We even tried that, it didn't help )There is parallel LCS in 1.2 which is supposed to alleviate the pain. Haven't upgraded yet, hope it works:) http://www.datastax.com/dev/blog/performance-improvements-in-cassandra-1-2 Since our cluster is not write intensive, only 100 w/seconds. I don't see any pending compactions during regular operation. One thing worth mentioning is the size of the SSTable, default is 5M which is kind of small for 200G (all in one CF) data set, and we are on SSD. It more than 150K files in one directory. (200G/5M = 40K SSTable and each SSTable creates 4 files on disk) You might want to watch that and decide the SSTable size. By the way, there is no concept of Major compaction for LCS. Just for fun, you can look at a file called $CFName.json in your data directory and it tells you the SSTable distribution among different levels. -Wei From: Charles Brophy To: user@cassandra.apache.org Sent: Thursday, February 14, 2013 8:29 AM Subject: Re: Size Tiered -> Leveled Compaction I second these questions: we've been looking into changing some of our CFs to use leveled compaction as well. If anybody here has the wisdom to answer them it would be of wonderful help. Thanks Charles On Wed, Feb 13, 2013 at 7:50 AM, Mike < mthero...@yahoo.com > wrote: Hello, I'm investigating the transition of some of our column families from Size Tiered -> Leveled Compaction. I believe we have some high-read-load column families that would benefit tremendously. I've stood up a test DB Node to investigate the transition. I successfully alter the column family, and I immediately noticed a large number (1000+) pending compaction tasks become available, but no compaction get executed. I tried running "nodetool sstableupgrade" on the column family, and the compaction tasks don't move. I also notice no changes to the size and distribution of the existing SSTables. I then run a major compaction on the column family. All pending compaction tasks get run, and the SSTables have a distribution that I would expect from LeveledCompaction (lots and lots of 10MB files). Couple of questions: 1) Is a major compaction required to transition from size-tiered to leveled compaction? 2) Are major compactions as much of a concern for LeveledCompaction as their are for Size Tiered? All the documentation I found concerning transitioning from Size Tiered to Level compaction discuss the alter table cql command, but I haven't found too much on what else needs to be done after the schema change. I did these tests with Cassandra 1.1.9. Thanks, -Mike
data type is object when metric instrument using Gauge?
Dear All We are trying to monitor Cassandra using JMX. The monitoring tool we are using works fine for meters, However, if the metrcis are collected using gauge, the data type is object, then, our tool treat it as a string instead of a double. for example org.apache.cassandra.metrics:type=Cache,scope=KeyCache,name=Capacity The Type of Attribute (Value) is java.lang.Object is it possible to implement the datatype of gauge as numeric types instead of object, or other way around for example using metric reporter...etc? Thanks a lot for any suggestion! Best Regard! Mike
Re: Issue with leveled compaction and data migration
Thanks for the response Rob, And yes, the relevel helped the bloom filter issue quite a bit, although it took a couple of days for the relevel to complete on a single node (so if anyone tried this, be prepared) -Mike Sent from my iPhone On Sep 23, 2013, at 6:34 PM, Robert Coli wrote: > On Fri, Sep 13, 2013 at 4:27 AM, Michael Theroux wrote: >> Another question on [the topic of row fragmentation when old rows get a >> large append to their "end" resulting in larger-than-expected bloom filters]. >> >> Would forcing the table to relevel help this situation? I believe the >> process to do this on 1.1.X would be to stop cassandra, remove .json file, >> and restart cassandra. Is this true? > > I believe forcing a re-level would help, because each row would appear in > fewer sstables and therefore fewer bloom filters. > > Yes, that is the process to re-level on Cassandra 1.1.x. > > =Rob
high latency on one node after replacement
Hi There - I have noticed an issue where I consistently see high p999 read latency on a node for a few hours after replacing the node. Before replacing the node, the p999 read latency is ~30ms, but after it increases to 1-5s. I am running C* 3.11.2 in EC2. I am testing out using EBS snapshots of the /data disk as a backup, so that I can replace nodes without having to fully bootstrap the replacement. This seems to work ok, except for the latency issue. Some things I have noticed: - `nodetool netstats` doesn't show any 'Completed' Large Messages, only 'Dropped', while this is going on. There are only a few of these. - the logs show warnings like this: WARN [PERIODIC-COMMIT-LOG-SYNCER] 2018-03-27 18:57:15,655 NoSpamLogger.java:94 - Out of 84 commit log syncs over the past 297.28s with average duration of 235.88ms, 86 have exceeded the configured commit interval by an average of 113.66ms and I can see some slow queries in debug.log, but I can't figure out what is causing it - gc seems normal Could this have something to do with starting the node with the EBS snapshot of the /data directory? My first thought was that this is related to the EBS volumes, but it seems too consistent to be actually caused by that. The problem is consistent across multiple replacements, and multiple EC2 regions. I appreciate any suggestions! - Mike
Re: high latency on one node after replacement
thanks for pointing that out, i just found it too :) i overlooked this On Tue, Mar 27, 2018 at 3:44 PM, Voytek Jarnot wrote: > Have you ruled out EBS snapshot initialization issues ( > https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ebs-initialize.html)? > > On Tue, Mar 27, 2018 at 2:24 PM, Mike Torra wrote: > >> Hi There - >> >> I have noticed an issue where I consistently see high p999 read latency >> on a node for a few hours after replacing the node. Before replacing the >> node, the p999 read latency is ~30ms, but after it increases to 1-5s. I am >> running C* 3.11.2 in EC2. >> >> I am testing out using EBS snapshots of the /data disk as a backup, so >> that I can replace nodes without having to fully bootstrap the replacement. >> This seems to work ok, except for the latency issue. Some things I have >> noticed: >> >> - `nodetool netstats` doesn't show any 'Completed' Large Messages, only >> 'Dropped', while this is going on. There are only a few of these. >> - the logs show warnings like this: >> >> WARN [PERIODIC-COMMIT-LOG-SYNCER] 2018-03-27 18:57:15,655 >> NoSpamLogger.java:94 - Out of 84 commit log syncs over the past 297.28s >> with average duration of 235.88ms, 86 have exceeded the configured commit >> interval by an average of 113.66ms >> and I can see some slow queries in debug.log, but I can't figure out >> what is causing it >> - gc seems normal >> >> Could this have something to do with starting the node with the EBS >> snapshot of the /data directory? My first thought was that this is related >> to the EBS volumes, but it seems too consistent to be actually caused by >> that. The problem is consistent across multiple replacements, and multiple >> EC2 regions. >> >> I appreciate any suggestions! >> >> - Mike >> > >
nodejs client can't connect to two nodes with different private ip addresses in different dcs
Hi Guys - I recently ran in to a problem (for the 2nd time) where my nodejs app for some reason refuses to connect to one node in my C* cluster. I noticed that in both cases, the node that was not receiving any client connections had the same private ip as another node in the cluster, but in a different datacenter. That prompted me to poke around the client code a bit, and I think I found the problem: https://github.com/datastax/nodejs-driver/blob/master/lib/control-connection.js#L647 Since `endpoint` is the `rpc_address` of the node, if I'm reading this right, the client will silently ignore other nodes that happen to have the same private ip. The first time I had this problem, I simply removed the node from the cluster and added a new one, with a different private ip. Now that I suspect I have found the problem, I'm wondering if there is a simpler solution. I realize this is specific to the nodejs client, but I thought I'd see if anyone else here has ran in to this. It would be great if I could get the nodejs client to ignore nodes in the remote data centers. I've already tried adding this to the client config, but it doesn't resolve the problem: ``` pooling: { coreConnectionsPerHost: { [distance.local]: 2, [distance.remote]: 0 } } ``` Any suggestions? - Mike
TWCS sstables not dropping even though all data is expired
Hello - I have a 48 node C* cluster spread across 4 AWS regions with RF=3. A few months ago I started noticing disk usage on some nodes increasing consistently. At first I solved the problem by destroying the nodes and rebuilding them, but the problem returns. I did some more investigation recently, and this is what I found: - I narrowed the problem down to a CF that uses TWCS, by simply looking at disk space usage - in each region, 3 nodes have this problem of growing disk space (matches replication factor) - on each node, I tracked down the problem to a particular SSTable using `sstableexpiredblockers` - in the SSTable, using `sstabledump`, I found a row that does not have a ttl like the other rows, and appears to be from someone else on the team testing something and forgetting to include a ttl - all other rows show "expired: true" except this one, hence my suspicion - when I query for that particular partition key, I get no results - I tried deleting the row anyways, but that didn't seem to change anything - I also tried `nodetool scrub`, but that didn't help either Would this rogue row without a ttl explain the problem? If so, why? If not, does anyone have any other ideas? Why does the row show in `sstabledump` but not when I query for it? I appreciate any help or suggestions! - Mike
Re: TWCS sstables not dropping even though all data is expired
I'm pretty stumped by this, so here is some more detail if it helps. Here is what the suspicious partition looks like in the `sstabledump` output (some pii etc redacted): ``` { "partition" : { "key" : [ "some_user_id_value", "user_id", "demo-test" ], "position" : 210 }, "rows" : [ { "type" : "row", "position" : 1132, "clustering" : [ "2019-01-22 15:27:45.000Z" ], "liveness_info" : { "tstamp" : "2019-01-22T15:31:12.415081Z" }, "cells" : [ { "some": "data" } ] } ] } ``` And here is what every other partition looks like: ``` { "partition" : { "key" : [ "some_other_user_id", "user_id", "some_site_id" ], "position" : 1133 }, "rows" : [ { "type" : "row", "position" : 1234, "clustering" : [ "2019-01-22 17:59:35.547Z" ], "liveness_info" : { "tstamp" : "2019-01-22T17:59:35.708Z", "ttl" : 86400, "expires_at" : "2019-01-23T17:59:35Z", "expired" : true }, "cells" : [ { "name" : "activity_data", "deletion_info" : { "local_delete_time" : "2019-01-22T17:59:35Z" } } ] } ] } ``` As expected, almost all of the data except this one suspicious partition has a ttl and is already expired. But if a partition isn't expired and I see it in the sstable, why wouldn't I see it executing a CQL query against the CF? Why would this sstable be preventing so many other sstable's from getting cleaned up? On Tue, Apr 30, 2019 at 12:34 PM Mike Torra wrote: > Hello - > > I have a 48 node C* cluster spread across 4 AWS regions with RF=3. A few > months ago I started noticing disk usage on some nodes increasing > consistently. At first I solved the problem by destroying the nodes and > rebuilding them, but the problem returns. > > I did some more investigation recently, and this is what I found: > - I narrowed the problem down to a CF that uses TWCS, by simply looking at > disk space usage > - in each region, 3 nodes have this problem of growing disk space (matches > replication factor) > - on each node, I tracked down the problem to a particular SSTable using > `sstableexpiredblockers` > - in the SSTable, using `sstabledump`, I found a row that does not have a > ttl like the other rows, and appears to be from someone else on the team > testing something and forgetting to include a ttl > - all other rows show "expired: true" except this one, hence my suspicion > - when I query for that particular partition key, I get no results > - I tried deleting the row anyways, but that didn't seem to change anything > - I also tried `nodetool scrub`, but that didn't help either > > Would this rogue row without a ttl explain the problem? If so, why? If > not, does anyone have any other ideas? Why does the row show in > `sstabledump` but not when I query for it? > > I appreciate any help or suggestions! > > - Mike >
Re: TWCS sstables not dropping even though all data is expired
This does indeed seem to be a problem of overlapping sstables, but I don't understand why the data (and number of sstables) just continues to grow indefinitely. I also don't understand why this problem is only appearing on some nodes. Is it just a coincidence that the one rogue test row without a ttl is at the 'root' sstable causing the problem (ie, from the output of `sstableexpiredblockers`)? Running a full compaction via `nodetool compact` reclaims the disk space, but I'd like to figure out why this happened and prevent it. Understanding why this problem would be isolated the way it is (ie only one CF even though I have a few others that share a very similar schema, and only some nodes) seems like it will help me prevent it. On Thu, May 2, 2019 at 1:00 PM Paul Chandler wrote: > Hi Mike, > > It sounds like that record may have been deleted, if that is the case then > it would still be shown in this sstable, but the deleted tombstone record > would be in a later sstable. You can use nodetool getsstables to work out > which sstables contain the data. > > I recommend reading The Last Pickle post on this: > http://thelastpickle.com/blog/2016/12/08/TWCS-part1.html the sections > towards the bottom of this post may well explain why the sstable is not > being deleted. > > Thanks > > Paul > www.redshots.com > > On 2 May 2019, at 16:08, Mike Torra wrote: > > I'm pretty stumped by this, so here is some more detail if it helps. > > Here is what the suspicious partition looks like in the `sstabledump` > output (some pii etc redacted): > ``` > { > "partition" : { > "key" : [ "some_user_id_value", "user_id", "demo-test" ], > "position" : 210 > }, > "rows" : [ > { > "type" : "row", > "position" : 1132, > "clustering" : [ "2019-01-22 15:27:45.000Z" ], > "liveness_info" : { "tstamp" : "2019-01-22T15:31:12.415081Z" }, > "cells" : [ > { "some": "data" } > ] > } > ] > } > ``` > > And here is what every other partition looks like: > ``` > { > "partition" : { > "key" : [ "some_other_user_id", "user_id", "some_site_id" ], > "position" : 1133 > }, > "rows" : [ > { > "type" : "row", > "position" : 1234, > "clustering" : [ "2019-01-22 17:59:35.547Z" ], > "liveness_info" : { "tstamp" : "2019-01-22T17:59:35.708Z", "ttl" : > 86400, "expires_at" : "2019-01-23T17:59:35Z", "expired" : true }, > "cells" : [ > { "name" : "activity_data", "deletion_info" : { > "local_delete_time" : "2019-01-22T17:59:35Z" } > } > ] > } > ] > } > ``` > > As expected, almost all of the data except this one suspicious partition > has a ttl and is already expired. But if a partition isn't expired and I > see it in the sstable, why wouldn't I see it executing a CQL query against > the CF? Why would this sstable be preventing so many other sstable's from > getting cleaned up? > > On Tue, Apr 30, 2019 at 12:34 PM Mike Torra wrote: > >> Hello - >> >> I have a 48 node C* cluster spread across 4 AWS regions with RF=3. A few >> months ago I started noticing disk usage on some nodes increasing >> consistently. At first I solved the problem by destroying the nodes and >> rebuilding them, but the problem returns. >> >> I did some more investigation recently, and this is what I found: >> - I narrowed the problem down to a CF that uses TWCS, by simply looking >> at disk space usage >> - in each region, 3 nodes have this problem of growing disk space >> (matches replication factor) >> - on each node, I tracked down the problem to a particular SSTable using >> `sstableexpiredblockers` >> - in the SSTable, using `sstabledump`, I found a row that does not have a >> ttl like the other rows, and appears to be from someone else on the team >> testing something and forgetting to include a ttl >> - all other rows show "expired: true" except this one, hence my suspicion >> - when I query for that particular partition key, I get no results >> - I tried deleting the row anyways, but that didn't seem to change >> anything >> - I also tried `nodetool scrub`, but that didn't help either >> >> Would this rogue row without a ttl explain the problem? If so, why? If >> not, does anyone have any other ideas? Why does the row show in >> `sstabledump` but not when I query for it? >> >> I appreciate any help or suggestions! >> >> - Mike >> > >
Re: TWCS sstables not dropping even though all data is expired
Thx for the help Paul - there are definitely some details here I still don't fully understand, but this helped me resolve the problem and know what to look for in the future :) On Fri, May 3, 2019 at 12:44 PM Paul Chandler wrote: > Hi Mike, > > For TWCS the sstable can only be deleted when all the data has expired in > that sstable, but you had a record without a ttl in it, so that sstable > could never be deleted. > > That bit is straight forward, the next bit I remember reading somewhere > but can’t find it at the moment to confirm my thinking. > > An sstable can only be deleted if it is the earliest sstable. I think this > is due to the fact that deleting later sstables may expose old versions of > the data stored in the stuck sstable which had been superseded. For > example, if there was a tombstone in a later sstable for the non TTLed > record causing the problem in this instance. Then deleting that sstable > would cause that deleted data to reappear. (Someone please correct me if I > have this wrong) > > Because sstables in different time buckets are never compacted together, > this problem only goes away when you did the major compaction. > > This would happen on all replicas of the data, hence the reason you this > problem on 3 nodes. > > Thanks > > Paul > www.redshots.com > > On 3 May 2019, at 15:35, Mike Torra wrote: > > This does indeed seem to be a problem of overlapping sstables, but I don't > understand why the data (and number of sstables) just continues to grow > indefinitely. I also don't understand why this problem is only appearing on > some nodes. Is it just a coincidence that the one rogue test row without a > ttl is at the 'root' sstable causing the problem (ie, from the output of > `sstableexpiredblockers`)? > > Running a full compaction via `nodetool compact` reclaims the disk space, > but I'd like to figure out why this happened and prevent it. Understanding > why this problem would be isolated the way it is (ie only one CF even > though I have a few others that share a very similar schema, and only some > nodes) seems like it will help me prevent it. > > > On Thu, May 2, 2019 at 1:00 PM Paul Chandler wrote: > >> Hi Mike, >> >> It sounds like that record may have been deleted, if that is the case >> then it would still be shown in this sstable, but the deleted tombstone >> record would be in a later sstable. You can use nodetool getsstables to >> work out which sstables contain the data. >> >> I recommend reading The Last Pickle post on this: >> http://thelastpickle.com/blog/2016/12/08/TWCS-part1.html the sections >> towards the bottom of this post may well explain why the sstable is not >> being deleted. >> >> Thanks >> >> Paul >> www.redshots.com >> >> On 2 May 2019, at 16:08, Mike Torra >> wrote: >> >> I'm pretty stumped by this, so here is some more detail if it helps. >> >> Here is what the suspicious partition looks like in the `sstabledump` >> output (some pii etc redacted): >> ``` >> { >> "partition" : { >> "key" : [ "some_user_id_value", "user_id", "demo-test" ], >> "position" : 210 >> }, >> "rows" : [ >> { >> "type" : "row", >> "position" : 1132, >> "clustering" : [ "2019-01-22 15:27:45.000Z" ], >> "liveness_info" : { "tstamp" : "2019-01-22T15:31:12.415081Z" }, >> "cells" : [ >> { "some": "data" } >> ] >> } >> ] >> } >> ``` >> >> And here is what every other partition looks like: >> ``` >> { >> "partition" : { >> "key" : [ "some_other_user_id", "user_id", "some_site_id" ], >> "position" : 1133 >> }, >> "rows" : [ >> { >> "type" : "row", >> "position" : 1234, >> "clustering" : [ "2019-01-22 17:59:35.547Z" ], >> "liveness_info" : { "tstamp" : "2019-01-22T17:59:35.708Z", "ttl" >> : 86400, "expires_at" : "2019-01-23T17:59:35Z", "expired" : true }, >> "cells" : [ >> { "name" : "activity_data", "deletion_info" : { >> "local_delete_time" : "2019-01
Re: TWCS sstables not dropping even though all data is expired
Compaction settings: ``` compaction = {'class': 'org.apache.cassandra.db.compaction.TimeWindowCompactionStrategy', 'compaction_window_size': '6', 'compaction_window_unit': 'HOURS', 'max_threshold': '32', 'min_threshold': '4'} ``` read_repair_chance is 0, and I don't do any repairs because (normally) everything has a ttl. It does seem like Jeff is right that a manual insert/update without a ttl is what caused this, so I know how to resolve it and prevent it from happening again. Thx again for all the help guys, I appreciate it! On Fri, May 3, 2019 at 11:21 PM Jeff Jirsa wrote: > Repairs work fine with TWCS, but having a non-expiring row will prevent > tombstones in newer sstables from being purged > > I suspect someone did a manual insert/update without a ttl and that > effectively blocks all other expiring cells from being purged. > > -- > Jeff Jirsa > > > On May 3, 2019, at 7:57 PM, Nick Hatfield > wrote: > > Hi Mike, > > > > If you will, share your compaction settings. More than likely, your issue > is from 1 of 2 reasons: > 1. You have read repair chance set to anything other than 0 > > 2. You’re running repairs on the TWCS CF > > > > Or both…. > > > > *From:* Mike Torra [mailto:mto...@salesforce.com.INVALID > ] > *Sent:* Friday, May 03, 2019 3:00 PM > *To:* user@cassandra.apache.org > *Subject:* Re: TWCS sstables not dropping even though all data is expired > > > > Thx for the help Paul - there are definitely some details here I still > don't fully understand, but this helped me resolve the problem and know > what to look for in the future :) > > > > On Fri, May 3, 2019 at 12:44 PM Paul Chandler wrote: > > Hi Mike, > > > > For TWCS the sstable can only be deleted when all the data has expired in > that sstable, but you had a record without a ttl in it, so that sstable > could never be deleted. > > > > That bit is straight forward, the next bit I remember reading somewhere > but can’t find it at the moment to confirm my thinking. > > > > An sstable can only be deleted if it is the earliest sstable. I think this > is due to the fact that deleting later sstables may expose old versions of > the data stored in the stuck sstable which had been superseded. For > example, if there was a tombstone in a later sstable for the non TTLed > record causing the problem in this instance. Then deleting that sstable > would cause that deleted data to reappear. (Someone please correct me if I > have this wrong) > > > > Because sstables in different time buckets are never compacted together, > this problem only goes away when you did the major compaction. > > > > This would happen on all replicas of the data, hence the reason you this > problem on 3 nodes. > > > > Thanks > > > > Paul > > www.redshots.com > > > > On 3 May 2019, at 15:35, Mike Torra wrote: > > > > This does indeed seem to be a problem of overlapping sstables, but I don't > understand why the data (and number of sstables) just continues to grow > indefinitely. I also don't understand why this problem is only appearing on > some nodes. Is it just a coincidence that the one rogue test row without a > ttl is at the 'root' sstable causing the problem (ie, from the output of > `sstableexpiredblockers`)? > > > > Running a full compaction via `nodetool compact` reclaims the disk space, > but I'd like to figure out why this happened and prevent it. Understanding > why this problem would be isolated the way it is (ie only one CF even > though I have a few others that share a very similar schema, and only some > nodes) seems like it will help me prevent it. > > > > > > On Thu, May 2, 2019 at 1:00 PM Paul Chandler wrote: > > Hi Mike, > > > > It sounds like that record may have been deleted, if that is the case then > it would still be shown in this sstable, but the deleted tombstone record > would be in a later sstable. You can use nodetool getsstables to work out > which sstables contain the data. > > > > I recommend reading The Last Pickle post on this: > http://thelastpickle.com/blog/2016/12/08/TWCS-part1.html the sections > towards the bottom of this post may well explain why the sstable is not > being deleted. > > > > Thanks > > > > Paul > > www.redshots.com > > > > On 2 May 2019, at 16:08, Mike Torra wrote: > > > > I'm pretty stumped by this, so here is some more detail if it helps. > > > > Here is what the suspicious partition looks like in the `sstabledump` > ou
Re: TWCS sstables not dropping even though all data is expired
Thx for the tips Jeff, I'm definitely going to start using table level TTLs (not sure why I didn't before), and I'll take a look at the tombstone compaction subproperties On Mon, May 6, 2019 at 10:43 AM Jeff Jirsa wrote: > Fwiw if you enable the tombstone compaction subproperties, you’ll compact > away most of the other data in those old sstables (but not the partition > that’s been manually updated) > > Also table level TTLs help catch this type of manual manipulation - > consider adding it if appropriate. > > -- > Jeff Jirsa > > > On May 6, 2019, at 7:29 AM, Mike Torra > wrote: > > Compaction settings: > ``` > compaction = {'class': > 'org.apache.cassandra.db.compaction.TimeWindowCompactionStrategy', > 'compaction_window_size': '6', 'compaction_window_unit': 'HOURS', > 'max_threshold': '32', 'min_threshold': '4'} > ``` > read_repair_chance is 0, and I don't do any repairs because (normally) > everything has a ttl. It does seem like Jeff is right that a manual > insert/update without a ttl is what caused this, so I know how to resolve > it and prevent it from happening again. > > Thx again for all the help guys, I appreciate it! > > > On Fri, May 3, 2019 at 11:21 PM Jeff Jirsa wrote: > >> Repairs work fine with TWCS, but having a non-expiring row will prevent >> tombstones in newer sstables from being purged >> >> I suspect someone did a manual insert/update without a ttl and that >> effectively blocks all other expiring cells from being purged. >> >> -- >> Jeff Jirsa >> >> >> On May 3, 2019, at 7:57 PM, Nick Hatfield >> wrote: >> >> Hi Mike, >> >> >> >> If you will, share your compaction settings. More than likely, your issue >> is from 1 of 2 reasons: >> 1. You have read repair chance set to anything other than 0 >> >> 2. You’re running repairs on the TWCS CF >> >> >> >> Or both…. >> >> >> >> *From:* Mike Torra [mailto:mto...@salesforce.com.INVALID >> ] >> *Sent:* Friday, May 03, 2019 3:00 PM >> *To:* user@cassandra.apache.org >> *Subject:* Re: TWCS sstables not dropping even though all data is expired >> >> >> >> Thx for the help Paul - there are definitely some details here I still >> don't fully understand, but this helped me resolve the problem and know >> what to look for in the future :) >> >> >> >> On Fri, May 3, 2019 at 12:44 PM Paul Chandler wrote: >> >> Hi Mike, >> >> >> >> For TWCS the sstable can only be deleted when all the data has expired in >> that sstable, but you had a record without a ttl in it, so that sstable >> could never be deleted. >> >> >> >> That bit is straight forward, the next bit I remember reading somewhere >> but can’t find it at the moment to confirm my thinking. >> >> >> >> An sstable can only be deleted if it is the earliest sstable. I think >> this is due to the fact that deleting later sstables may expose old >> versions of the data stored in the stuck sstable which had been superseded. >> For example, if there was a tombstone in a later sstable for the non TTLed >> record causing the problem in this instance. Then deleting that sstable >> would cause that deleted data to reappear. (Someone please correct me if I >> have this wrong) >> >> >> >> Because sstables in different time buckets are never compacted together, >> this problem only goes away when you did the major compaction. >> >> >> >> This would happen on all replicas of the data, hence the reason you this >> problem on 3 nodes. >> >> >> >> Thanks >> >> >> >> Paul >> >> www.redshots.com >> >> >> >> On 3 May 2019, at 15:35, Mike Torra >> wrote: >> >> >> >> This does indeed seem to be a problem of overlapping sstables, but I >> don't understand why the data (and number of sstables) just continues to >> grow indefinitely. I also don't understand why this problem is only >> appearing on some nodes. Is it just a coincidence that the one rogue test >> row without a ttl is at the 'root' sstable causing the problem (ie, from >> the output of `sstableexpiredblockers`)? >> >> >> >> Running a full compaction via `nodetool compact` reclaims the disk space, >> but I'd like to figure out why this happened and prevent it. Understanding >> why
Recovery for deleted SSTables files for one column family.
Hi all, I would like to know, is there any way to rebuild a particular column family when all the SSTables files for this column family are missing?? Say we do not have any backup of it. Thank you. Regards, Mike Yeap
Re: Recovery for deleted SSTables files for one column family.
Hi Ben, the scenario that I was trying to test was all sstables (deleted) from one node. So I did what you suggested (rebuild the sstables from other replicas in the cluster) and it rebuilt the sstables successfully. I think the reason that I didn't see the sstables rebuilt earlier on was because I didn't use the -full option of the "nodetool rebuild". Thanks! Regards, Mike Yeap On Thu, May 19, 2016 at 4:03 PM, Ben Slater wrote: > Use nodetool listsnapshots to check if you have a snapshot - in default > configuration, Cassandra takes snapshots for operations like truncate. > > Failing that, is it all sstables from all nodes? In this case, your data > has gone I’m afraid. If it’s just all sstables from one node then running > repair will rebuild the sstables from the other replicas in the cluster. > > Cheers > Ben > > On Thu, 19 May 2016 at 17:57 Mike Yeap wrote: > >> Hi all, I would like to know, is there any way to rebuild a particular >> column family when all the SSTables files for this column family are >> missing?? Say we do not have any backup of it. >> >> Thank you. >> >> Regards, >> Mike Yeap >> > -- > > Ben Slater > Chief Product Officer, Instaclustr > +61 437 929 798 >
Cassandra and Kubernetes and scaling
I saw a thread from April 2016 talking about Cassandra and Kubernetes, and have a few follow up questions. It seems that especially after v1.2 of Kubernetes, and the upcoming 1.3 features, this would be a very viable option of running Cassandra on. My questions pertain to HostIds and Scaling Up/Down, and are related: 1. If a container's host dies and is then brought up on another host, can you start up with the same PersistentVolume as the original container had? Which begs the question would the new container get a new HostId, implying it would need to bootstrap into the environment? If it's a bootstrap, does the old one get deco'd/assassinated? 2. Scaling up/down. Scaling up would be relatively easy, as it should just kick off Bootstrapping the node into the cluster, but what if you need to scale down? Would the Container get deco'd by the scaling down process? or just terminated, leaving you with potential missing replicas 3. Scaling up and increasing the RF of a particular keyspace, would there be a clean way to do this with the kubernetes tooling? In the end I'm wondering how much of the Kubernetes + Cassandra involves nodetool, and how much is just a Docker image where you need to manage that all yourself (painfully) -- --mike
Re: Increasing replication factor and repair doesn't seem to work
Hi Luke, I've encountered similar problem before, could you please advise on following? 1) when you add 10.128.0.20, what are the seeds defined in cassandra.yaml? 2) when you add 10.128.0.20, were the data and cache directories in 10.128.0.20 empty? - /var/lib/cassandra/data - /var/lib/cassandra/saved_caches 3) if you do a compact in 10.128.0.3, what is the size shown in "Load" column in "nodetool status "? 4) when you do the full repair, did you use "nodetool repair" or "nodetool repair -full"? I'm asking this because Incremental Repair is the default for Cassandra 2.2 and later. Regards, Mike Yeap On Wed, May 25, 2016 at 8:01 AM, Bryan Cheng wrote: > Hi Luke, > > I've never found nodetool status' load to be useful beyond a general > indicator. > > You should expect some small skew, as this will depend on your current > compaction status, tombstones, etc. IIRC repair will not provide > consistency of intermediate states nor will it remove tombstones, it only > guarantees consistency in the final state. This means, in the case of > dropped hints or mutations, you will see differences in intermediate > states, and therefore storage footrpint, even in fully repaired nodes. This > includes intermediate UPDATE operations as well. > > Your one node with sub 1GB sticks out like a sore thumb, though. Where did > you originate the nodetool repair from? Remember that repair will only > ensure consistency for ranges held by the node you're running it on. While > I am not sure if missing ranges are included in this, if you ran nodetool > repair only on a machine with partial ownership, you will need to complete > repairs across the ring before data will return to full consistency. > > I would query some older data using consistency = ONE on the affected > machine to determine if you are actually missing data. There are a few > outstanding bugs in the 2.1.x and older release families that may result > in tombstone creation even without deletes, for example CASSANDRA-10547, > which impacts updates on collections in pre-2.1.13 Cassandra. > > You can also try examining the output of nodetool ring, which will give > you a breakdown of tokens and their associations within your cluster. > > --Bryan > > On Tue, May 24, 2016 at 3:49 PM, kurt Greaves > wrote: > >> Not necessarily considering RF is 2 so both nodes should have all >> partitions. Luke, are you sure the repair is succeeding? You don't have >> other keyspaces/duplicate data/extra data in your cassandra data directory? >> Also, you could try querying on the node with less data to confirm if it >> has the same dataset. >> >> On 24 May 2016 at 22:03, Bhuvan Rawal wrote: >> >>> For the other DC, it can be acceptable because partition reside on one >>> node, so say if you have a large partition, it may skew things a bit. >>> On May 25, 2016 2:41 AM, "Luke Jolly" wrote: >>> >>>> So I guess the problem may have been with the initial addition of the >>>> 10.128.0.20 node because when I added it in it never synced data I >>>> guess? It was at around 50 MB when it first came up and transitioned to >>>> "UN". After it was in I did the 1->2 replication change and tried repair >>>> but it didn't fix it. From what I can tell all the data on it is stuff >>>> that has been written since it came up. We never delete data ever so we >>>> should have zero tombstones. >>>> >>>> If I am not mistaken, only two of my nodes actually have all the data, >>>> 10.128.0.3 and 10.142.0.14 since they agree on the data amount. 10.142.0.13 >>>> is almost a GB lower and then of course 10.128.0.20 which is missing >>>> over 5 GB of data. I tried running nodetool -local on both DCs and it >>>> didn't fix either one. >>>> >>>> Am I running into a bug of some kind? >>>> >>>> On Tue, May 24, 2016 at 4:06 PM Bhuvan Rawal >>>> wrote: >>>> >>>>> Hi Luke, >>>>> >>>>> You mentioned that replication factor was increased from 1 to 2. In >>>>> that case was the node bearing ip 10.128.0.20 carried around 3GB data >>>>> earlier? >>>>> >>>>> You can run nodetool repair with option -local to initiate repair >>>>> local datacenter for gce-us-central1. >>>>> >>>>> Also you may suspect that if a lot of data was deleted while the node >>>>> was down it may be having a lot of tombstones which is not needed to be >
Re: Error while rebuilding a node: Stream failed
Hi George, are you using NetworkTopologyStrategy as the replication strategy for your keyspace? If yes, can you check the cassandra-rackdc.properties of this new node? https://issues.apache.org/jira/browse/CASSANDRA-8279 Regards, Mike Yeap On Wed, May 25, 2016 at 2:31 PM, George Sigletos wrote: > I am getting this error repeatedly while I am trying to add a new DC > consisting of one node in AWS to my existing cluster. I have tried 5 times > already. Running Cassandra 2.1.13 > > I have also set: > streaming_socket_timeout_in_ms: 360 > in all of my nodes > > Does anybody have any idea how this can be fixed? Thanks in advance > > Kind regards, > George > > P.S. > The complete stack trace: > -- StackTrace -- > java.lang.RuntimeException: Error while rebuilding node: Stream failed > at > org.apache.cassandra.service.StorageService.rebuild(StorageService.java:1076) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source) > at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source) > at java.lang.reflect.Method.invoke(Unknown Source) > at sun.reflect.misc.Trampoline.invoke(Unknown Source) > at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source) > at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source) > at java.lang.reflect.Method.invoke(Unknown Source) > at sun.reflect.misc.MethodUtil.invoke(Unknown Source) > at > com.sun.jmx.mbeanserver.StandardMBeanIntrospector.invokeM2(Unknown Source) > at > com.sun.jmx.mbeanserver.StandardMBeanIntrospector.invokeM2(Unknown Source) > at com.sun.jmx.mbeanserver.MBeanIntrospector.invokeM(Unknown > Source) > at com.sun.jmx.mbeanserver.PerInterface.invoke(Unknown Source) > at com.sun.jmx.mbeanserver.MBeanSupport.invoke(Unknown Source) > at > com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.invoke(Unknown Source) > at com.sun.jmx.mbeanserver.JmxMBeanServer.invoke(Unknown Source) > at > javax.management.remote.rmi.RMIConnectionImpl.doOperation(Unknown Source) > at > javax.management.remote.rmi.RMIConnectionImpl.access$300(Unknown Source) > at > javax.management.remote.rmi.RMIConnectionImpl$PrivilegedOperation.run(Unknown > Source) > at > javax.management.remote.rmi.RMIConnectionImpl.doPrivilegedOperation(Unknown > Source) > at javax.management.remote.rmi.RMIConnectionImpl.invoke(Unknown > Source) > at sun.reflect.GeneratedMethodAccessor5.invoke(Unknown Source) > at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source) > at java.lang.reflect.Method.invoke(Unknown Source) > at sun.rmi.server.UnicastServerRef.dispatch(Unknown Source) > at sun.rmi.transport.Transport$2.run(Unknown Source) > at sun.rmi.transport.Transport$2.run(Unknown Source) > at java.security.AccessController.doPrivileged(Native Method) > at sun.rmi.transport.Transport.serviceCall(Unknown Source) > at sun.rmi.transport.tcp.TCPTransport.handleMessages(Unknown > Source) > at > sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run0(Unknown Source) > at > sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.access$400(Unknown > Source) > at > sun.rmi.transport.tcp.TCPTransport$ConnectionHandler$1.run(Unknown Source) > at > sun.rmi.transport.tcp.TCPTransport$ConnectionHandler$1.run(Unknown Source) > at java.security.AccessController.doPrivileged(Native Method) > at > sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run(Unknown Source) > at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown > Source) > at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown > Source) > at java.lang.Thread.run(Unknown Source) >
Re: Out of memory issues
Hi Paolo, a) was there any large insertion done? b) are the a lot of files in the saved_caches directory? c) would you consider to increase the HEAP_NEWSIZE to, say, 1200M? Regards, Mike Yeap On Fri, May 27, 2016 at 12:39 AM, Paolo Crosato < paolo.cros...@targaubiest.com> wrote: > Hi, > > we are running a cluster of 4 nodes, each one has the same sizing: 2 > cores, 16G ram and 1TB of disk space. > > On every node we are running cassandra 2.0.17, oracle java version > "1.7.0_45", centos 6 with this kernel version 2.6.32-431.17.1.el6.x86_64 > > Two nodes are running just fine, the other two have started to go OOM at > every start. > > This is the error we get: > > INFO [ScheduledTasks:1] 2016-05-26 18:15:58,460 StatusLogger.java (line > 70) ReadRepairStage 0 0116 > 0 0 > INFO [ScheduledTasks:1] 2016-05-26 18:15:58,462 StatusLogger.java (line > 70) MutationStage31 1369 20526 > 0 0 > INFO [ScheduledTasks:1] 2016-05-26 18:15:58,590 StatusLogger.java (line > 70) ReplicateOnWriteStage 0 0 0 > 0 0 > INFO [ScheduledTasks:1] 2016-05-26 18:15:58,591 StatusLogger.java (line > 70) GossipStage 0 0335 > 0 0 > INFO [ScheduledTasks:1] 2016-05-26 18:16:04,195 StatusLogger.java (line > 70) CacheCleanupExecutor 0 0 0 > 0 0 > INFO [ScheduledTasks:1] 2016-05-26 18:16:06,526 StatusLogger.java (line > 70) MigrationStage0 0 0 > 0 0 > INFO [ScheduledTasks:1] 2016-05-26 18:16:06,527 StatusLogger.java (line > 70) MemoryMeter 1 4 26 > 0 0 > INFO [ScheduledTasks:1] 2016-05-26 18:16:06,527 StatusLogger.java (line > 70) ValidationExecutor0 0 0 > 0 0 > DEBUG [MessagingService-Outgoing-/10.255.235.19] 2016-05-26 18:16:06,518 > OutboundTcpConnection.java (line 290) attempting to connect to / > 10.255.235.19 > INFO [GossipTasks:1] 2016-05-26 18:16:22,912 Gossiper.java (line 992) > InetAddress /10.255.235.28 is now DOWN > INFO [ScheduledTasks:1] 2016-05-26 18:16:22,952 StatusLogger.java (line > 70) FlushWriter 1 5 47 > 025 > INFO [ScheduledTasks:1] 2016-05-26 18:16:22,953 StatusLogger.java (line > 70) InternalResponseStage 0 0 0 > 0 0 > ERROR [ReadStage:27] 2016-05-26 18:16:29,250 CassandraDaemon.java (line > 258) Exception in thread Thread[ReadStage:27,5,main] > java.lang.OutOfMemoryError: Java heap space > at > org.apache.cassandra.io.util.RandomAccessReader.readBytes(RandomAccessReader.java:347) > at > org.apache.cassandra.utils.ByteBufferUtil.read(ByteBufferUtil.java:392) > at > org.apache.cassandra.utils.ByteBufferUtil.readWithLength(ByteBufferUtil.java:355) > at > org.apache.cassandra.db.ColumnSerializer.deserializeColumnBody(ColumnSerializer.java:124) > at > org.apache.cassandra.db.OnDiskAtom$Serializer.deserializeFromSSTable(OnDiskAtom.java:85) > at org.apache.cassandra.db.Column$1.computeNext(Column.java:75) > at org.apache.cassandra.db.Column$1.computeNext(Column.java:64) > at > com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:143) > at > com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:138) > at > com.google.common.collect.AbstractIterator.next(AbstractIterator.java:153) > at > org.apache.cassandra.db.columniterator.IndexedSliceReader$IndexedBlockFetcher.getNextBlock(IndexedSliceReader.java:434) > at > org.apache.cassandra.db.columniterator.IndexedSliceReader$IndexedBlockFetcher.fetchMoreData(IndexedSliceReader.java:387) > at > org.apache.cassandra.db.columniterator.IndexedSliceReader.computeNext(IndexedSliceReader.java:145) > at > org.apache.cassandra.db.columniterator.IndexedSliceReader.computeNext(IndexedSliceReader.java:45) > at > com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:143) > at > com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:138) > at > org.apache.cassandra.db.columniterator.SSTableSliceIterator.hasNext(SSTableSliceIterator.java:82) > at > org.apache.cassandra.db.filter.QueryFilter$2.getNext(QueryFilter.java:157) > at > org.apache.cassandra.db.filter.QueryFilter$2.hasNext(QueryFilter.java:140) > at > org.apache.cassandra.utils.MergeIterator$Candidate.advance(Mer
Re: Node Stuck while restarting
Hi Bhuvan, how big are your current commit logs in the failed node, and what are the sizes MAX_HEAP_SIZE and HEAP_NEWSIZE? Also the values of following properties in cassandra.yaml?? memtable_allocation_type memtable_cleanup_threshold memtable_flush_writers memtable_heap_space_in_mb memtable_offheap_space_in_mb Regards, Mike Yeap On Sun, May 29, 2016 at 6:18 PM, Bhuvan Rawal wrote: > Hi, > > We are running a 6 Node cluster in 2 DC on DSC 3.0.3, with 3 Node each. > One of the node was showing UNREACHABLE on other nodes in nodetool > describecluster and on that node it was showing all others UNREACHABLE and > as a measure we restarted the node. > > But on doing that it is stuck possibly at with these messages in > system.log: > > DEBUG [SlabPoolCleaner] 2016-05-29 14:07:28,156 ColumnFamilyStore.java:829 > - Enqueuing flush of batches: 226784704 (11%) on-heap, 0 (0%) off-heap > DEBUG [main] 2016-05-29 14:07:28,576 CommitLogReplayer.java:415 - > Replaying /commitlog/data/CommitLog-6-1464508993391.log (CL version 6, > messaging version 10, compression null) > DEBUG [main] 2016-05-29 14:07:28,781 ColumnFamilyStore.java:829 - > Enqueuing flush of batches: 207333510 (10%) on-heap, 0 (0%) off-heap > > MemtablePostFlush / MemtableFlushWriter stages where it is stuck with > pending messages. > This has been the status of them as per *nodetool tpstats *for long. > MemtablePostFlush Active - 1pending - 52 > completed - 16 > MemtableFlushWriter Active - 2pending - 13 > completed - 15 > > > We restarted the node by setting log level to TRACE but in vain. What > could be a possible contingency plan in such a scenario? > > Best Regards, > Bhuvan > >
Re: [Marketing Mail] Cassandra 2.1: Snapshot data changing while transferring
Hi Paul, what is the value of the snapshot_before_compaction property in your cassandra.yaml? Say if another snapshot is being taken (because compaction kicked in and snapshot_before_compaction property is set to TRUE) and at this moment you're tarring the snapshot folders.. Maybe can take a look at the records in system.compaction: select * from system.compaction_history; Regards, Mike Yeap On Tue, May 31, 2016 at 5:21 PM, Paul Dunkler wrote: > And - as an addition: > > Shoudln't that be documented that even snapshot files can change? > > I guess this might come from the incremental repairs... > > The repair time is stored in the sstable (RepairedAt timestamp metadata). > > > ok, that sounds interesting. > Could that also happen to incremental backup files as well? I had another > case where incremental backup files were totally deleted automagically. > > And - what is the suggested way to solve that problem? Should i try again > tar-ing the snapshot until it doesn't happen anymore that something changes > in between? > Or is there a way to "pause" the incremental repairs? > > > Cheers, > Reynald > > On 31/05/2016 11:03, Paul Dunkler wrote: > > Hi there, > > i am sometimes running in very strange errors while backing up snapshots > from a cassandra cluster. > > Cassandra version: > 2.1.11 > > What i basically do: > 1. nodetool snapshot > 2. tar all snapshot folders into one file > 3. transfer them to another server > > What happens is that tar just sometimes give the error message "file > changed as we read it" while its adding a .db-file from the folder of the > previously created snapshot. > If i understand everything correct, this SHOULD never happen. Snapshots > should be totally immutable, right? > > Am i maybe hitting a bug or is there some rare case with running repair > operations or what-so-ever which can change snapshotted data? > I already searched through cassandra jira but couldn't find a bug which > looks related to this behaviour. > > Would love to get some help on this. > > — > Paul Dunkler > > > > — > Paul Dunkler > > ** * * UPLEX - Nils Goroll Systemoptimierung > > Scheffelstraße 32 > 22301 Hamburg > > tel +49 40 288 057 31 > mob +49 151 252 228 42 > fax +49 40 429 497 53 > > xmpp://pauldunk...@jabber.ccc.de > > http://uplex.de/ > > > — > Paul Dunkler > > ** * * UPLEX - Nils Goroll Systemoptimierung > > Scheffelstraße 32 > 22301 Hamburg > > tel +49 40 288 057 31 > mob +49 151 252 228 42 > fax +49 40 429 497 53 > > xmpp://pauldunk...@jabber.ccc.de > > http://uplex.de/ > >
Ring connection timeouts with 2.2.6
Hi, We have a 12 node 2.2.6 ring running in AWS, single DC with RF=3, that is sitting at <25% CPU, doing mostly writes, and not showing any particular long GC times/pauses. By all observed metrics the ring is healthy and performing well. However, we are noticing a pretty consistent number of connection timeouts coming from the messaging service between various pairs of nodes in the ring. The "Connection.TotalTimeouts" meter metric show 100k's of timeouts per minute, usually between two pairs of nodes for several hours at a time. It seems to occur for several hours at a time, then may stop or move to other pairs of nodes in the ring. The metric "Connection.SmallMessageDroppedTasks." will also grow for one pair of the nodes in the TotalTimeouts metric. Looking at the debug log typically shows a large number of messages like the following on one of the nodes: StorageProxy.java:1033 - Skipped writing hint for /172.26.33.177 (ttl 0) We have cross node timeouts enabled, but ntp is running on all nodes and no node appears to have time drift. The network appears to be fine between nodes, with iperf tests showing that we have a lot of headroom. Any thoughts on what to look for? Can we increase thread count/pool sizes for the messaging service? Thanks, Mike -- Mike Heffner Librato, Inc.
Re: Ring connection timeouts with 2.2.6
One thing to add, if we do a rolling restart of the ring the timeouts disappear entirely for several hours and performance returns to normal. It's as if something is leaking over time, but we haven't seen any noticeable change in heap. On Thu, Jun 23, 2016 at 10:38 AM, Mike Heffner wrote: > Hi, > > We have a 12 node 2.2.6 ring running in AWS, single DC with RF=3, that is > sitting at <25% CPU, doing mostly writes, and not showing any particular > long GC times/pauses. By all observed metrics the ring is healthy and > performing well. > > However, we are noticing a pretty consistent number of connection timeouts > coming from the messaging service between various pairs of nodes in the > ring. The "Connection.TotalTimeouts" meter metric show 100k's of timeouts > per minute, usually between two pairs of nodes for several hours at a time. > It seems to occur for several hours at a time, then may stop or move to > other pairs of nodes in the ring. The metric > "Connection.SmallMessageDroppedTasks." will also grow for one pair of > the nodes in the TotalTimeouts metric. > > Looking at the debug log typically shows a large number of messages like > the following on one of the nodes: > > StorageProxy.java:1033 - Skipped writing hint for /172.26.33.177 (ttl 0) > > We have cross node timeouts enabled, but ntp is running on all nodes and > no node appears to have time drift. > > The network appears to be fine between nodes, with iperf tests showing > that we have a lot of headroom. > > Any thoughts on what to look for? Can we increase thread count/pool sizes > for the messaging service? > > Thanks, > > Mike > > -- > > Mike Heffner > Librato, Inc. > > -- Mike Heffner Librato, Inc.
Re: Ring connection timeouts with 2.2.6
Jens, We haven't noticed any particular large GC operations or even persistently high GC times. Mike On Thu, Jun 30, 2016 at 3:20 AM, Jens Rantil wrote: > Hi, > > Could it be garbage collection occurring on nodes that are more heavily > loaded? > > Cheers, > Jens > > Den sön 26 juni 2016 05:22Mike Heffner skrev: > >> One thing to add, if we do a rolling restart of the ring the timeouts >> disappear entirely for several hours and performance returns to normal. >> It's as if something is leaking over time, but we haven't seen any >> noticeable change in heap. >> >> On Thu, Jun 23, 2016 at 10:38 AM, Mike Heffner wrote: >> >>> Hi, >>> >>> We have a 12 node 2.2.6 ring running in AWS, single DC with RF=3, that >>> is sitting at <25% CPU, doing mostly writes, and not showing any particular >>> long GC times/pauses. By all observed metrics the ring is healthy and >>> performing well. >>> >>> However, we are noticing a pretty consistent number of connection >>> timeouts coming from the messaging service between various pairs of nodes >>> in the ring. The "Connection.TotalTimeouts" meter metric show 100k's of >>> timeouts per minute, usually between two pairs of nodes for several hours >>> at a time. It seems to occur for several hours at a time, then may stop or >>> move to other pairs of nodes in the ring. The metric >>> "Connection.SmallMessageDroppedTasks." will also grow for one pair of >>> the nodes in the TotalTimeouts metric. >>> >>> Looking at the debug log typically shows a large number of messages like >>> the following on one of the nodes: >>> >>> StorageProxy.java:1033 - Skipped writing hint for /172.26.33.177 (ttl 0) >>> >>> We have cross node timeouts enabled, but ntp is running on all nodes and >>> no node appears to have time drift. >>> >>> The network appears to be fine between nodes, with iperf tests showing >>> that we have a lot of headroom. >>> >>> Any thoughts on what to look for? Can we increase thread count/pool >>> sizes for the messaging service? >>> >>> Thanks, >>> >>> Mike >>> >>> -- >>> >>> Mike Heffner >>> Librato, Inc. >>> >>> >> >> >> -- >> >> Mike Heffner >> Librato, Inc. >> >> -- > > Jens Rantil > Backend Developer @ Tink > > Tink AB, Wallingatan 5, 111 60 Stockholm, Sweden > For urgent matters you can reach me at +46-708-84 18 32. > -- Mike Heffner Librato, Inc.
Re: Ring connection timeouts with 2.2.6
Jeff, Thanks, yeah we updated to the 2.16.4 driver version from source. I don't believe we've hit the bugs mentioned in earlier driver versions. Mike On Mon, Jul 4, 2016 at 11:16 PM, Jeff Jirsa wrote: > AWS ubuntu 14.04 AMI ships with buggy enhanced networking driver – > depending on your instance types / hypervisor choice, you may want to > ensure you’re not seeing that bug. > > > > *From: *Mike Heffner > *Reply-To: *"user@cassandra.apache.org" > *Date: *Friday, July 1, 2016 at 1:10 PM > *To: *"user@cassandra.apache.org" > *Cc: *Peter Norton > *Subject: *Re: Ring connection timeouts with 2.2.6 > > > > Jens, > > > > We haven't noticed any particular large GC operations or even persistently > high GC times. > > > > Mike > > > > On Thu, Jun 30, 2016 at 3:20 AM, Jens Rantil wrote: > > Hi, > > Could it be garbage collection occurring on nodes that are more heavily > loaded? > > Cheers, > Jens > > > > Den sön 26 juni 2016 05:22Mike Heffner skrev: > > One thing to add, if we do a rolling restart of the ring the timeouts > disappear entirely for several hours and performance returns to normal. > It's as if something is leaking over time, but we haven't seen any > noticeable change in heap. > > > > On Thu, Jun 23, 2016 at 10:38 AM, Mike Heffner wrote: > > Hi, > > > > We have a 12 node 2.2.6 ring running in AWS, single DC with RF=3, that is > sitting at <25% CPU, doing mostly writes, and not showing any particular > long GC times/pauses. By all observed metrics the ring is healthy and > performing well. > > > > However, we are noticing a pretty consistent number of connection timeouts > coming from the messaging service between various pairs of nodes in the > ring. The "Connection.TotalTimeouts" meter metric show 100k's of timeouts > per minute, usually between two pairs of nodes for several hours at a time. > It seems to occur for several hours at a time, then may stop or move to > other pairs of nodes in the ring. The metric > "Connection.SmallMessageDroppedTasks." will also grow for one pair of > the nodes in the TotalTimeouts metric. > > > > Looking at the debug log typically shows a large number of messages like > the following on one of the nodes: > > > > StorageProxy.java:1033 - Skipped writing hint for /172.26.33.177 > <https://urldefense.proofpoint.com/v2/url?u=http-3A__172.26.33.177&d=CwMFaQ&c=08AGY6txKsvMOP6lYkHQpPMRA1U6kqhAwGa8-0QCg3M&r=yfYEBHVkX6l0zImlOIBID0gmhluYPD5Jje-3CtaT3ow&m=KlMh_-rpcOH2Mdf3i2XGCQhtU4ZuD0Y37WpHKGlKtnQ&s=ihxNa3DwQPrfqEURi_UIncjESJC_XexR_AjY81coG8U&e=> > (ttl 0) > > We have cross node timeouts enabled, but ntp is running on all nodes and > no node appears to have time drift. > > > > The network appears to be fine between nodes, with iperf tests showing > that we have a lot of headroom. > > > > Any thoughts on what to look for? Can we increase thread count/pool sizes > for the messaging service? > > > > Thanks, > > > > Mike > > > > -- > > > Mike Heffner > > Librato, Inc. > > > > > > > > -- > > > Mike Heffner > > Librato, Inc. > > > > -- > > Jens Rantil > Backend Developer @ Tink > > Tink AB, Wallingatan 5, 111 60 Stockholm, Sweden > For urgent matters you can reach me at +46-708-84 18 32. > > > > > > -- > > > Mike Heffner > > Librato, Inc. > > > -- Mike Heffner Librato, Inc.
Re: Ring connection timeouts with 2.2.6
Just to followup on this post with a couple of more data points: 1) We upgraded to 2.2.7 and did not see any change in behavior. 2) However, what *has* fixed this issue for us was disabling msg coalescing by setting: otc_coalescing_strategy: DISABLED We were using the default setting before (time horizon I believe). We see periodic timeouts on the ring (once every few hours), but they are brief and don't impact latency. With msg coalescing turned on we would see these timeouts persist consistently after an initial spike. My guess is that something in the coalescing logic is disturbed by the initial timeout spike which leads to dropping all / high-percentage of all subsequent traffic. We are planning to continue production use with msg coaleasing disabled for now and may run tests in our staging environments to identify where the coalescing is breaking this. Mike On Tue, Jul 5, 2016 at 12:14 PM, Mike Heffner wrote: > Jeff, > > Thanks, yeah we updated to the 2.16.4 driver version from source. I don't > believe we've hit the bugs mentioned in earlier driver versions. > > Mike > > On Mon, Jul 4, 2016 at 11:16 PM, Jeff Jirsa > wrote: > >> AWS ubuntu 14.04 AMI ships with buggy enhanced networking driver – >> depending on your instance types / hypervisor choice, you may want to >> ensure you’re not seeing that bug. >> >> >> >> *From: *Mike Heffner >> *Reply-To: *"user@cassandra.apache.org" >> *Date: *Friday, July 1, 2016 at 1:10 PM >> *To: *"user@cassandra.apache.org" >> *Cc: *Peter Norton >> *Subject: *Re: Ring connection timeouts with 2.2.6 >> >> >> >> Jens, >> >> >> >> We haven't noticed any particular large GC operations or even >> persistently high GC times. >> >> >> >> Mike >> >> >> >> On Thu, Jun 30, 2016 at 3:20 AM, Jens Rantil wrote: >> >> Hi, >> >> Could it be garbage collection occurring on nodes that are more heavily >> loaded? >> >> Cheers, >> Jens >> >> >> >> Den sön 26 juni 2016 05:22Mike Heffner skrev: >> >> One thing to add, if we do a rolling restart of the ring the timeouts >> disappear entirely for several hours and performance returns to normal. >> It's as if something is leaking over time, but we haven't seen any >> noticeable change in heap. >> >> >> >> On Thu, Jun 23, 2016 at 10:38 AM, Mike Heffner wrote: >> >> Hi, >> >> >> >> We have a 12 node 2.2.6 ring running in AWS, single DC with RF=3, that is >> sitting at <25% CPU, doing mostly writes, and not showing any particular >> long GC times/pauses. By all observed metrics the ring is healthy and >> performing well. >> >> >> >> However, we are noticing a pretty consistent number of connection >> timeouts coming from the messaging service between various pairs of nodes >> in the ring. The "Connection.TotalTimeouts" meter metric show 100k's of >> timeouts per minute, usually between two pairs of nodes for several hours >> at a time. It seems to occur for several hours at a time, then may stop or >> move to other pairs of nodes in the ring. The metric >> "Connection.SmallMessageDroppedTasks." will also grow for one pair of >> the nodes in the TotalTimeouts metric. >> >> >> >> Looking at the debug log typically shows a large number of messages like >> the following on one of the nodes: >> >> >> >> StorageProxy.java:1033 - Skipped writing hint for /172.26.33.177 >> <https://urldefense.proofpoint.com/v2/url?u=http-3A__172.26.33.177&d=CwMFaQ&c=08AGY6txKsvMOP6lYkHQpPMRA1U6kqhAwGa8-0QCg3M&r=yfYEBHVkX6l0zImlOIBID0gmhluYPD5Jje-3CtaT3ow&m=KlMh_-rpcOH2Mdf3i2XGCQhtU4ZuD0Y37WpHKGlKtnQ&s=ihxNa3DwQPrfqEURi_UIncjESJC_XexR_AjY81coG8U&e=> >> (ttl 0) >> >> We have cross node timeouts enabled, but ntp is running on all nodes and >> no node appears to have time drift. >> >> >> >> The network appears to be fine between nodes, with iperf tests showing >> that we have a lot of headroom. >> >> >> >> Any thoughts on what to look for? Can we increase thread count/pool sizes >> for the messaging service? >> >> >> >> Thanks, >> >> >> >> Mike >> >> >> >> -- >> >> >> Mike Heffner >> >> Librato, Inc. >> >> >> >> >> >> >> >> -- >> >> >> Mike Heffner >> >> Librato, Inc. >> >> >> >> -- >> >> Jens Rantil >> Backend Developer @ Tink >> >> Tink AB, Wallingatan 5, 111 60 Stockholm, Sweden >> For urgent matters you can reach me at +46-708-84 18 32. >> >> >> >> >> >> -- >> >> >> Mike Heffner >> >> Librato, Inc. >> >> >> > > > > -- > > Mike Heffner > Librato, Inc. > > -- Mike Heffner Librato, Inc.
Re: Ring connection timeouts with 2.2.6
Garo, No, we didn't notice any change in system load, just the expected spike in packet counts. Mike On Wed, Jul 20, 2016 at 3:49 PM, Juho Mäkinen wrote: > Just to pick this up: Did you see any system load spikes? I'm tracing a > problem on 2.2.7 where my cluster sees load spikes up to 20-30, when the > normal average load is around 3-4. So far I haven't found any good reason, > but I'm going to try otc_coalescing_strategy: disabled tomorrow. > > - Garo > > On Fri, Jul 15, 2016 at 6:16 PM, Mike Heffner wrote: > >> Just to followup on this post with a couple of more data points: >> >> 1) >> >> We upgraded to 2.2.7 and did not see any change in behavior. >> >> 2) >> >> However, what *has* fixed this issue for us was disabling msg coalescing >> by setting: >> >> otc_coalescing_strategy: DISABLED >> >> We were using the default setting before (time horizon I believe). >> >> We see periodic timeouts on the ring (once every few hours), but they are >> brief and don't impact latency. With msg coalescing turned on we would see >> these timeouts persist consistently after an initial spike. My guess is >> that something in the coalescing logic is disturbed by the initial timeout >> spike which leads to dropping all / high-percentage of all subsequent >> traffic. >> >> We are planning to continue production use with msg coaleasing disabled >> for now and may run tests in our staging environments to identify where the >> coalescing is breaking this. >> >> Mike >> >> On Tue, Jul 5, 2016 at 12:14 PM, Mike Heffner wrote: >> >>> Jeff, >>> >>> Thanks, yeah we updated to the 2.16.4 driver version from source. I >>> don't believe we've hit the bugs mentioned in earlier driver versions. >>> >>> Mike >>> >>> On Mon, Jul 4, 2016 at 11:16 PM, Jeff Jirsa >>> wrote: >>> >>>> AWS ubuntu 14.04 AMI ships with buggy enhanced networking driver – >>>> depending on your instance types / hypervisor choice, you may want to >>>> ensure you’re not seeing that bug. >>>> >>>> >>>> >>>> *From: *Mike Heffner >>>> *Reply-To: *"user@cassandra.apache.org" >>>> *Date: *Friday, July 1, 2016 at 1:10 PM >>>> *To: *"user@cassandra.apache.org" >>>> *Cc: *Peter Norton >>>> *Subject: *Re: Ring connection timeouts with 2.2.6 >>>> >>>> >>>> >>>> Jens, >>>> >>>> >>>> >>>> We haven't noticed any particular large GC operations or even >>>> persistently high GC times. >>>> >>>> >>>> >>>> Mike >>>> >>>> >>>> >>>> On Thu, Jun 30, 2016 at 3:20 AM, Jens Rantil >>>> wrote: >>>> >>>> Hi, >>>> >>>> Could it be garbage collection occurring on nodes that are more heavily >>>> loaded? >>>> >>>> Cheers, >>>> Jens >>>> >>>> >>>> >>>> Den sön 26 juni 2016 05:22Mike Heffner skrev: >>>> >>>> One thing to add, if we do a rolling restart of the ring the timeouts >>>> disappear entirely for several hours and performance returns to normal. >>>> It's as if something is leaking over time, but we haven't seen any >>>> noticeable change in heap. >>>> >>>> >>>> >>>> On Thu, Jun 23, 2016 at 10:38 AM, Mike Heffner >>>> wrote: >>>> >>>> Hi, >>>> >>>> >>>> >>>> We have a 12 node 2.2.6 ring running in AWS, single DC with RF=3, that >>>> is sitting at <25% CPU, doing mostly writes, and not showing any particular >>>> long GC times/pauses. By all observed metrics the ring is healthy and >>>> performing well. >>>> >>>> >>>> >>>> However, we are noticing a pretty consistent number of connection >>>> timeouts coming from the messaging service between various pairs of nodes >>>> in the ring. The "Connection.TotalTimeouts" meter metric show 100k's of >>>> timeouts per minute, usually between two pairs of nodes for several hours >>>> at a time. It seems to occur for several hours at a time, then may stop or >>>> move to other pairs of nodes in the ring. The metric >
unsubscribe
unsubscribe
unsubscribe
failing bootstraps with OOM
Hi All - I am trying to bootstrap a replacement node in a cluster, but it consistently fails to bootstrap because of OOM exceptions. For almost a week I've been going through cycles of bootstrapping, finding errors, then restarting / resuming bootstrap, and I am struggling to move forward. Sometimes the bootstrapping node itself fails, which usually manifests first as very high GC times (sometimes 30s+!), then nodetool commands start to fail with timeouts, then the node will crash with an OOM exception. Other times, a node streaming data to this bootstrapping node will have a similar failure. In either case, when it happens I need to restart the crashed node, then resume the bootstrap. On top of these issues, when I do need to restart a node it takes a lng time (http://stackoverflow.com/questions/40141739/why-does-cassandra-sometimes-take-a-hours-to-start). This exasperates the problem because it takes so long to find out if a change to the cluster helps or if it still fails. I am in the process of upgrading all nodes in the cluster from m4.xlarge to c4.4xlarge, and I am running Cassandra DDC 3.5 on all nodes. The cluster has 26 nodes spread across 4 regions in EC2. Here is some other relevant cluster info (also in stack overflow post): Cluster Info * Cassandra DDC 3.5 * EC2MultiRegionSnitch * m4.xlarge, moving to c4.4xlarge Schema Info * 3 CF's, all 'write once' (ie no updates), 1 week ttl, STCS (default) * no secondary indexes I am unsure what to try next. The node that is currently having this bootstrap problem is a pretty beefy box, with 16 cores, 30G of ram, and a 3.2T EBS volume. The slow startup time might be because of the issues with a high number of SSTables that Jeff Jirsa mentioned in a comment on the SO post, but I am at a loss for the OOM issues. I've tried: * Changing from CMS to G1 GC, which seemed to have helped a bit * Upgrading from 3.5 to 3.9, which did not seem to help * Upgrading instance types from m4.xlarge to c4.4xlarge, which seems to help, but I'm still having issues I'd appreciate any suggestions on what else I can try to track down the cause of these OOM exceptions. - Mike
Re: failing bootstraps with OOM
Hi Alex - I do monitor sstable counts and pending compactions, but probably not closely enough. In 3/4 regions the cluster is running in, both counts are very high - ~30-40k sstables for one particular CF, and on many nodes >1k pending compactions. I had noticed this before, but I didn't have a good sense of what a "high" number for these values was. It makes sense to me why this would cause the issues I've seen. After increasing concurrent_compactors and compaction_throughput_mb_per_sec (to 8 and 64mb, respectively), I'm starting to see those counts go down steadily. Hopefully that will resolve the OOM issues, but it looks like it will take a while for compactions to catch up. Thanks for the suggestions, Alex From: Oleksandr Shulgin mailto:oleksandr.shul...@zalando.de>> Reply-To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org>" mailto:user@cassandra.apache.org>> Date: Wednesday, November 2, 2016 at 1:07 PM To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org>" mailto:user@cassandra.apache.org>> Subject: Re: failing bootstraps with OOM On Wed, Nov 2, 2016 at 3:35 PM, Mike Torra mailto:mto...@demandware.com>> wrote: > > Hi All - > > I am trying to bootstrap a replacement node in a cluster, but it consistently > fails to bootstrap because of OOM exceptions. For almost a week I've been > going through cycles of bootstrapping, finding errors, then restarting / > resuming bootstrap, and I am struggling to move forward. Sometimes the > bootstrapping node itself fails, which usually manifests first as very high > GC times (sometimes 30s+!), then nodetool commands start to fail with > timeouts, then the node will crash with an OOM exception. Other times, a node > streaming data to this bootstrapping node will have a similar failure. In > either case, when it happens I need to restart the crashed node, then resume > the bootstrap. > > On top of these issues, when I do need to restart a node it takes a lng > time > (http://stackoverflow.com/questions/40141739/why-does-cassandra-sometimes-take-a-hours-to-start). > This exasperates the problem because it takes so long to find out if a > change to the cluster helps or if it still fails. I am in the process of > upgrading all nodes in the cluster from m4.xlarge to c4.4xlarge, and I am > running Cassandra DDC 3.5 on all nodes. The cluster has 26 nodes spread > across 4 regions in EC2. Here is some other relevant cluster info (also in > stack overflow post): > > Cluster Info > > Cassandra DDC 3.5 > EC2MultiRegionSnitch > m4.xlarge, moving to c4.4xlarge > > Schema Info > > 3 CF's, all 'write once' (ie no updates), 1 week ttl, STCS (default) > no secondary indexes > > I am unsure what to try next. The node that is currently having this > bootstrap problem is a pretty beefy box, with 16 cores, 30G of ram, and a > 3.2T EBS volume. The slow startup time might be because of the issues with a > high number of SSTables that Jeff Jirsa mentioned in a comment on the SO > post, but I am at a loss for the OOM issues. I've tried: > > Changing from CMS to G1 GC, which seemed to have helped a bit > Upgrading from 3.5 to 3.9, which did not seem to help > Upgrading instance types from m4.xlarge to c4.4xlarge, which seems to help, > but I'm still having issues > > I'd appreciate any suggestions on what else I can try to track down the cause > of these OOM exceptions. Hi, Do you monitor pending compactions and actual number of SSTable files? On startup Cassandra needs to touch most of the data files and also seems to keep some metadata about every relevant file in memory. We once went into situation where we ended up with hundreds of thousands of files per node which resulted in OOMs on every other node of the ring, and startup time was of over half an hour (this was on version 2.1). If you have much more files than you expect, then you should check and adjust your concurrent_compactors and compaction_throughput_mb_per_sec settings. Increase concurrent_compactors if you're behind (pending compactions metric is a hint) and consider un-throttling compaction before your situation is back to normal. Cheers, -- Alex
weird jvm metrics
Hi There - I recently upgraded from cassandra 3.5 to 3.9 (DDC), and I noticed that the "new" jvm metrics are reporting with an extra '.' character in them. Here is a snippet of what I see from one of my nodes: ubuntu@ip-10-0-2-163:~$ sudo tcpdump -i eth0 -v dst port 2003 -A | grep 'jvm' tcpdump: listening on eth0, link-type EN10MB (Ethernet), capture size 65535 bytes .Je..l>.pi.cassandra.us-east-1.cassy-node1.jvm.buffers..direct.capacity 762371494 1482960946 pi.cassandra.us-east-1.cassy-node1.jvm.buffers..direct.count 3054 1482960946 pi.cassandra.us-east-1.cassy-node1.jvm.buffers..direct.used 762371496 1482960946 pi.cassandra.us-east-1.cassy-node1.jvm.buffers..mapped.capacity 515226631134 1482960946 pi.cassandra.us-east-1.cassy-node1.jvm.buffers..mapped.count 45572 1482960946 pi.cassandra.us-east-1.cassy-node1.jvm.buffers..mapped.used 515319762610 1482960946 pi.cassandra.us-east-1.cassy-node1.jvm.fd.usage 0.00 1482960946 My metrics.yaml looks like this: graphite: - period: 60 timeunit: 'SECONDS' prefix: 'pi.cassandra.us-east-1.cassy-node1' hosts: - host: '#RELAY_HOST#' port: 2003 predicate: color: "white" useQualifiedName: true patterns: - "^org.+" - "^jvm.+" - "^java.lang.+" All the org.* metrics come through fine, and the jvm.fd.usage metric strangely comes through fine, too. The rest of the jvm.* metrics have this extra '.' character that causes them to not show up in graphite. Am I missing something silly here? Appreciate any help or suggestions. - Mike
Re: weird jvm metrics
Just bumping - has anyone seen this before? http://stackoverflow.com/questions/41446352/cassandra-3-9-jvm-metrics-have-bad-name From: Mike Torra mailto:mto...@demandware.com>> Reply-To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org>" mailto:user@cassandra.apache.org>> Date: Wednesday, December 28, 2016 at 4:49 PM To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org>" mailto:user@cassandra.apache.org>> Subject: weird jvm metrics Hi There - I recently upgraded from cassandra 3.5 to 3.9 (DDC), and I noticed that the "new" jvm metrics are reporting with an extra '.' character in them. Here is a snippet of what I see from one of my nodes: ubuntu@ip-10-0-2-163:~$ sudo tcpdump -i eth0 -v dst port 2003 -A | grep 'jvm' tcpdump: listening on eth0, link-type EN10MB (Ethernet), capture size 65535 bytes .Je..l>.pi.cassandra.us-east-1.cassy-node1.jvm.buffers..direct.capacity 762371494 1482960946 pi.cassandra.us-east-1.cassy-node1.jvm.buffers..direct.count 3054 1482960946 pi.cassandra.us-east-1.cassy-node1.jvm.buffers..direct.used 762371496 1482960946 pi.cassandra.us-east-1.cassy-node1.jvm.buffers..mapped.capacity 515226631134 1482960946 pi.cassandra.us-east-1.cassy-node1.jvm.buffers..mapped.count 45572 1482960946 pi.cassandra.us-east-1.cassy-node1.jvm.buffers..mapped.used 515319762610 1482960946 pi.cassandra.us-east-1.cassy-node1.jvm.fd.usage 0.00 1482960946 My metrics.yaml looks like this: graphite: - period: 60 timeunit: 'SECONDS' prefix: 'pi.cassandra.us-east-1.cassy-node1' hosts: - host: '#RELAY_HOST#' port: 2003 predicate: color: "white" useQualifiedName: true patterns: - "^org.+" - "^jvm.+" - "^java.lang.+" All the org.* metrics come through fine, and the jvm.fd.usage metric strangely comes through fine, too. The rest of the jvm.* metrics have this extra '.' character that causes them to not show up in graphite. Am I missing something silly here? Appreciate any help or suggestions. - Mike
implementing a 'sorted set' on top of cassandra
We currently use redis to store sorted sets that we increment many, many times more than we read. For example, only about 5% of these sets are ever read. We are getting to the point where redis is becoming difficult to scale (currently at >20 nodes). We've started using cassandra for other things, and now we are experimenting to see if having a similar 'sorted set' data structure is feasible in cassandra. My approach so far is: 1. Use a counter CF to store the values I want to sort by 2. Periodically read in all key/values in the counter CF and sort in the client application (~every five minutes or so) 3. Write back to a different CF with the ordered keys I care about Does this seem crazy? Is there a simpler way to do this in cassandra?
Re: implementing a 'sorted set' on top of cassandra
Thanks for the feedback everyone! Redis `zincryby` and `zrangebyscore` is indeed what we use today. Caching the resulting 'sorted sets' in redis is exactly what I plan to do. There will be tens of thousands of these sorted sets, each generally with <10k items (with maybe a few exceptions going a bit over that). The reason to periodically calculate the set and store it in cassandra is to avoid having the client do that work, when the client only really cares about the top 100 or so items at any given time. Being truly "real time" is not critical for us, but it is a selling point to be as up to date as possible. I'd like to understand the performance issue of frequently updating these sets. I understand that every time I 'regenerate' the sorted set, any rows that change will create a tombstone - for example, if "item_1" is in first place and "item_2" is in second place, then they switch on the next update, that would be two tombstones. Do you think this will be a big enough problem that it is worth doing the sorting work client side, on demand, and just try to eat the performance hit there? My thought was to make a tradeoff by using more cassandra disk space (ie pre calculating all sets), in exchange for faster reads when requests actually come in that need this data. From: Benjamin Roth mailto:benjamin.r...@jaumo.com>> Reply-To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org>" mailto:user@cassandra.apache.org>> Date: Saturday, January 14, 2017 at 1:25 PM To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org>" mailto:user@cassandra.apache.org>> Subject: Re: implementing a 'sorted set' on top of cassandra Mike mentioned "increment" in his initial post. That let me think of a case with increments and fetching a top list by a counter like https://redis.io/commands/zincrby https://redis.io/commands/zrangebyscore 1. Cassandra is absolutely not made to sort by a counter (or a non-counter numeric incrementing value) but it is made to store counters. In this case a partition could be seen as a set. 2. I thought of CS for persistence and - depending on the app requirements like real-time and set size - still use redis as a read cache 2017-01-14 18:45 GMT+01:00 Jonathan Haddad mailto:j...@jonhaddad.com>>: Sorted sets don't have a requirement of incrementing / decrementing. They're commonly used for thing like leaderboards where the values are arbitrary. In Redis they are implemented with 2 data structures for efficient lookups of either key or value. No getting around that as far as I know. In Cassandra they would require using the score as a clustering column in order to select top N scores (and paginate). That means a tombstone whenever the value for a key in the set changes. In sets with high rates of change that means a lot of tombstones and thus terrible performance. On Sat, Jan 14, 2017 at 9:40 AM DuyHai Doan mailto:doanduy...@gmail.com>> wrote: Sorting on an "incremented" numeric value has always been a nightmare to be done properly in C* Either use Counter type but then no sorting is possible since counter cannot be used as type for clustering column (which allows sort) Or use simple numeric type on clustering column but then to increment the value *concurrently* and *safely* it's prohibitive (SELECT to fetch current value + UPDATE ... IF value = ) + retry On Sat, Jan 14, 2017 at 8:54 AM, Benjamin Roth mailto:benjamin.r...@jaumo.com>> wrote: If your proposed solution is crazy depends on your needs :) It sounds like you can live with not-realtime data. So it is ok to cache it. Why preproduce the results if you only need 5% of them? Why not use redis as a cache with expiring sorted sets that are filled on demand from cassandra partitions with counters? So redis has much less to do and can scale much better. And you are not limited on keeping all data in ram as cache data is volatile and can be evicted on demand. If this is effective also depends on the size of your sets. CS wont be able to sort them by score for you, so you will have to load the complete set to redis for caching and / or do sorting in your app on demand. This certainly won't work out well with sets with millions of entries. 2017-01-13 23:14 GMT+01:00 Mike Torra mailto:mto...@demandware.com>>: We currently use redis to store sorted sets that we increment many, many times more than we read. For example, only about 5% of these sets are ever read. We are getting to the point where redis is becoming difficult to scale (currently at >20 nodes). We've started using cassandra for other things, and now we are experimenting to see if having a similar 'sorted set' data structure is feasible in cassandra. My approach so far is: 1. Use a counter CF to store the values I wan
lots of connection timeouts around same time every day
Hi there - Cluster info: C* 3.9, replicated across 4 EC2 regions (us-east-1, us-west-2, eu-west-1, ap-southeast-1), c4.4xlarge Around the same time every day (~7-8am EST), 2 DC's (eu-west-1 and ap-southeast-1) in our cluster start experiencing a high number of timeouts (Connection.TotalTimeouts metric). The issue seems to occur equally on all nodes in the impacted DC. I'm trying to track down exactly what is timing out, and what is causing it to happen. With debug logs, I can see many messages like this: DEBUG [GossipTasks:1] 2017-02-16 15:39:42,274 Gossiper.java:337 - Convicting /xx.xx.xx.xx with status NORMAL - alive false DEBUG [GossipTasks:1] 2017-02-16 15:39:42,274 Gossiper.java:337 - Convicting /xx.xx.xx.xx with status removed - alive false DEBUG [GossipTasks:1] 2017-02-16 15:39:42,274 Gossiper.java:337 - Convicting /xx.xx.xx.xx with status shutdown - alive false The 'status removed' node I `nodetool remove`'d from the cluster, so I'm not sure why that appears. The node mentioned in the 'status NORMAL' line has constant warnings like this: WARN [GossipTasks:1] 2017-02-16 15:40:02,845 Gossiper.java:771 - Gossip stage has 453589 pending tasks; skipping status check (no nodes will be marked down) These lines seem to go away after restarting that node, and on the original node, the 'Convicting' lines go away as well. However, the timeout counts do not seem to change. Why does restarting the node seem to fix gossip falling behind? There are also a lot of debug log messages like this: DEBUG [GossipStage:1] 2017-02-16 15:45:04,849 FailureDetector.java:456 - Ignoring interval time of 2355580769 for /xx.xx.xx.xx Could these be related to the high number of timeouts I see? I've also tried increasing the value of phi_convict_threshold to 12, as suggested here: https://docs.datastax.com/en/cassandra/3.0/cassandra/architecture/archDataDistributeFailDetect.html. This does not seem to have changed anything on the nodes that I've changed it on. I appreciate any suggestions on what else to try in order to track down these timeouts. - Mike
Re: lots of connection timeouts around same time every day
I can't say that I have tried that while the issue is going on, but I have done such rolling restarts for sure, and the timeouts still occur every day. What would a rolling restart do to fix the issue? In fact, as I write this, I am restarting each node one by one in the eu-west-1 datacenter, and in us-east-1 I am seeing lots of timeouts - both the metrics 'Connection.TotalTimeouts.m1_rate' and 'ClientRequest.Latency.Read.p999' flatlining at ~6s. Why would restarting in one datacenter impact reads in another? Any suggestions on what to investigate next, or what changes to try in the cluster? Happy to provide any more info as well :) On Fri, Feb 17, 2017 at 6:05 AM, kurt greaves wrote: > have you tried a rolling restart of the entire DC? >
Significant drop in storage load after 2.1.6->2.1.8 upgrade
Hi all, I've been upgrading several of our rings from 2.1.6 to 2.1.8 and I've noticed that after the upgrade our storage load drops significantly (I've seen up to an 80% drop). I believe most of the data that is dropped is tombstoned (via TTL expiration) and I haven't detected any data loss yet. However, can someone point me to what changed between 2.1.6 and 2.1.8 that would lead to such a significant drop in tombstoned data? Looking at the changelog there's nothing that jumps out at me. This is a CF definition from one of the CFs that had a significant drop: > describe measures_mid_1; CREATE TABLE "Metrics".measures_mid_1 ( key blob, c1 int, c2 blob, c3 blob, PRIMARY KEY (key, c1, c2) ) WITH COMPACT STORAGE AND CLUSTERING ORDER BY (c1 ASC, c2 ASC) AND bloom_filter_fp_chance = 0.01 AND caching = '{"keys":"ALL", "rows_per_partition":"NONE"}' AND comment = '' AND compaction = {'class': 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy'} AND compression = {'sstable_compression': 'org.apache.cassandra.io.compress.LZ4Compressor'} AND dclocal_read_repair_chance = 0.1 AND default_time_to_live = 0 AND gc_grace_seconds = 0 AND max_index_interval = 2048 AND memtable_flush_period_in_ms = 0 AND min_index_interval = 128 AND read_repair_chance = 0.0 AND speculative_retry = '99.0PERCENTILE'; Thanks, Mike -- Mike Heffner Librato, Inc.
Re: Significant drop in storage load after 2.1.6->2.1.8 upgrade
Nate, Thanks. I dug through the changes a bit more and I believe my original observation may have been due to: https://github.com/krummas/cassandra/commit/fbc47e3b950949a8aa191bc7e91eb6cb396fe6a8 from: https://issues.apache.org/jira/browse/CASSANDRA-9572 I had originally passed over it because we are not using DTCS, but it matches since the upgrade appeared to only drop fully expired sstables. Mike On Sat, Jul 18, 2015 at 3:40 PM, Nate McCall wrote: > Perhaps https://issues.apache.org/jira/browse/CASSANDRA-9592 got > compactions moving forward for you? This would explain the drop. > > However, the discussion on > https://issues.apache.org/jira/browse/CASSANDRA-9683 seems to be similar > to what you saw and that is currently being investigated. > > On Fri, Jul 17, 2015 at 10:24 AM, Mike Heffner wrote: > >> Hi all, >> >> I've been upgrading several of our rings from 2.1.6 to 2.1.8 and I've >> noticed that after the upgrade our storage load drops significantly (I've >> seen up to an 80% drop). >> >> I believe most of the data that is dropped is tombstoned (via TTL >> expiration) and I haven't detected any data loss yet. However, can someone >> point me to what changed between 2.1.6 and 2.1.8 that would lead to such a >> significant drop in tombstoned data? Looking at the changelog there's >> nothing that jumps out at me. This is a CF definition from one of the CFs >> that had a significant drop: >> >> > describe measures_mid_1; >> >> CREATE TABLE "Metrics".measures_mid_1 ( >> key blob, >> c1 int, >> c2 blob, >> c3 blob, >> PRIMARY KEY (key, c1, c2) >> ) WITH COMPACT STORAGE >> AND CLUSTERING ORDER BY (c1 ASC, c2 ASC) >> AND bloom_filter_fp_chance = 0.01 >> AND caching = '{"keys":"ALL", "rows_per_partition":"NONE"}' >> AND comment = '' >> AND compaction = {'class': >> 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy'} >> AND compression = {'sstable_compression': >> 'org.apache.cassandra.io.compress.LZ4Compressor'} >> AND dclocal_read_repair_chance = 0.1 >> AND default_time_to_live = 0 >> AND gc_grace_seconds = 0 >> AND max_index_interval = 2048 >> AND memtable_flush_period_in_ms = 0 >> AND min_index_interval = 128 >> AND read_repair_chance = 0.0 >> AND speculative_retry = '99.0PERCENTILE'; >> >> Thanks, >> >> Mike >> >> -- >> >> Mike Heffner >> Librato, Inc. >> >> > > > -- > - > Nate McCall > Austin, TX > @zznate > > Co-Founder & Sr. Technical Consultant > Apache Cassandra Consulting > http://www.thelastpickle.com > -- Mike Heffner Librato, Inc.
Debugging write timeouts on Cassandra 2.2.5
Hi all, We've recently embarked on a project to update our Cassandra infrastructure running on EC2. We are long time users of 2.0.x and are testing out a move to version 2.2.5 running on VPC with EBS. Our test setup is a 3 node, RF=3 cluster supporting a small write load (mirror of our staging load). We are writing at QUORUM and while p95's look good compared to our staging 2.0.x cluster, we are seeing frequent write operations that time out at the max write_request_timeout_in_ms (10 seconds). CPU across the cluster is < 10% and EBS write load is < 100 IOPS. Cassandra is running with the Oracle JDK 8u60 and we're using G1GC and any GC pauses are less than 500ms. We run on c4.2xl instances with GP2 EBS attached storage for data and commitlog directories. The nodes are using EC2 enhanced networking and have the latest Intel network driver module. We are running on HVM instances using Ubuntu 14.04.2. Our schema is 5 tables, all with COMPACT STORAGE. Each table is similar to the definition here: https://gist.github.com/mheffner/4d80f6b53ccaa24cc20a This is our cassandra.yaml: https://gist.github.com/mheffner/fea80e6e939dd483f94f#file-cassandra-yaml Like I mentioned we use 8u60 with G1GC and have used many of the GC settings in Al Tobey's tuning guide. This is our upstart config with JVM and other CPU settings: https://gist.github.com/mheffner/dc44613620b25c4fa46d We've used several of the sysctl settings from Al's guide as well: https://gist.github.com/mheffner/ea40d58f58a517028152 Our client application is able to write using either Thrift batches using Asytanax driver or CQL async INSERT's using the Datastax Java driver. For testing against Thrift (our legacy infra uses this) we write batches of anywhere from 6 to 1500 rows at a time. Our p99 for batch execution is around 45ms but our maximum (p100) sits less than 150ms except when it periodically spikes to the full 10seconds. Testing the same write path using CQL writes instead demonstrates similar behavior. Low p99s except for periodic full timeouts. We enabled tracing for several operations but were unable to get a trace that completed successfully -- Cassandra started logging many messages as: INFO [ScheduledTasks:1] - MessagingService.java:946 - _TRACE messages were dropped in last 5000 ms: 52499 for internal timeout and 0 for cross node timeout And all the traces contained rows with a "null" source_elapsed row: https://gist.githubusercontent.com/mheffner/1d68a70449bd6688a010/raw/0327d7d3d94c3a93af02b64212e3b7e7d8f2911b/trace.out We've exhausted as many configuration option permutations that we can think of. This cluster does not appear to be under any significant load and latencies seem to largely fall in two bands: low normal or max timeout. This seems to imply that something is getting stuck and timing out at the max write timeout. Any suggestions on what to look for? We had debug enabled for awhile but we didn't see any msg that pointed to something obvious. Happy to provide any more information that may help. We are pretty much at the point of sprinkling debug around the code to track down what could be blocking. Thanks, Mike -- Mike Heffner Librato, Inc.
Re: Debugging write timeouts on Cassandra 2.2.5
Paulo, Thanks for the suggestion, we ran some tests against CMS and saw the same timeouts. On that note though, we are going to try doubling the instance sizes and testing with double the heap (even though current usage is low). Mike On Wed, Feb 10, 2016 at 3:40 PM, Paulo Motta wrote: > Are you using the same GC settings as the staging 2.0 cluster? If not, > could you try using the default GC settings (CMS) and see if that changes > anything? This is just a wild guess, but there were reports before of > G1-caused instabilities with small heap sizes (< 16GB - see CASSANDRA-10403 > for more context). Please ignore if you already tried reverting back to CMS. > > 2016-02-10 16:51 GMT-03:00 Mike Heffner : > >> Hi all, >> >> We've recently embarked on a project to update our Cassandra >> infrastructure running on EC2. We are long time users of 2.0.x and are >> testing out a move to version 2.2.5 running on VPC with EBS. Our test setup >> is a 3 node, RF=3 cluster supporting a small write load (mirror of our >> staging load). >> >> We are writing at QUORUM and while p95's look good compared to our >> staging 2.0.x cluster, we are seeing frequent write operations that time >> out at the max write_request_timeout_in_ms (10 seconds). CPU across the >> cluster is < 10% and EBS write load is < 100 IOPS. Cassandra is running >> with the Oracle JDK 8u60 and we're using G1GC and any GC pauses are less >> than 500ms. >> >> We run on c4.2xl instances with GP2 EBS attached storage for data and >> commitlog directories. The nodes are using EC2 enhanced networking and have >> the latest Intel network driver module. We are running on HVM instances >> using Ubuntu 14.04.2. >> >> Our schema is 5 tables, all with COMPACT STORAGE. Each table is similar >> to the definition here: >> https://gist.github.com/mheffner/4d80f6b53ccaa24cc20a >> >> This is our cassandra.yaml: >> https://gist.github.com/mheffner/fea80e6e939dd483f94f#file-cassandra-yaml >> >> Like I mentioned we use 8u60 with G1GC and have used many of the GC >> settings in Al Tobey's tuning guide. This is our upstart config with JVM >> and other CPU settings: >> https://gist.github.com/mheffner/dc44613620b25c4fa46d >> >> We've used several of the sysctl settings from Al's guide as well: >> https://gist.github.com/mheffner/ea40d58f58a517028152 >> >> Our client application is able to write using either Thrift batches using >> Asytanax driver or CQL async INSERT's using the Datastax Java driver. >> >> For testing against Thrift (our legacy infra uses this) we write batches >> of anywhere from 6 to 1500 rows at a time. Our p99 for batch execution is >> around 45ms but our maximum (p100) sits less than 150ms except when it >> periodically spikes to the full 10seconds. >> >> Testing the same write path using CQL writes instead demonstrates similar >> behavior. Low p99s except for periodic full timeouts. We enabled tracing >> for several operations but were unable to get a trace that completed >> successfully -- Cassandra started logging many messages as: >> >> INFO [ScheduledTasks:1] - MessagingService.java:946 - _TRACE messages >> were dropped in last 5000 ms: 52499 for internal timeout and 0 for cross >> node timeout >> >> And all the traces contained rows with a "null" source_elapsed row: >> https://gist.githubusercontent.com/mheffner/1d68a70449bd6688a010/raw/0327d7d3d94c3a93af02b64212e3b7e7d8f2911b/trace.out >> >> >> We've exhausted as many configuration option permutations that we can >> think of. This cluster does not appear to be under any significant load and >> latencies seem to largely fall in two bands: low normal or max timeout. >> This seems to imply that something is getting stuck and timing out at the >> max write timeout. >> >> Any suggestions on what to look for? We had debug enabled for awhile but >> we didn't see any msg that pointed to something obvious. Happy to provide >> any more information that may help. >> >> We are pretty much at the point of sprinkling debug around the code to >> track down what could be blocking. >> >> >> Thanks, >> >> Mike >> >> -- >> >> Mike Heffner >> Librato, Inc. >> >> > -- Mike Heffner Librato, Inc.
Re: Debugging write timeouts on Cassandra 2.2.5
Jeff, We have both commitlog and data on a 4TB EBS with 10k IOPS. Mike On Wed, Feb 10, 2016 at 5:28 PM, Jeff Jirsa wrote: > What disk size are you using? > > > > From: Mike Heffner > Reply-To: "user@cassandra.apache.org" > Date: Wednesday, February 10, 2016 at 2:24 PM > To: "user@cassandra.apache.org" > Cc: Peter Norton > Subject: Re: Debugging write timeouts on Cassandra 2.2.5 > > Paulo, > > Thanks for the suggestion, we ran some tests against CMS and saw the same > timeouts. On that note though, we are going to try doubling the instance > sizes and testing with double the heap (even though current usage is low). > > Mike > > On Wed, Feb 10, 2016 at 3:40 PM, Paulo Motta > wrote: > >> Are you using the same GC settings as the staging 2.0 cluster? If not, >> could you try using the default GC settings (CMS) and see if that changes >> anything? This is just a wild guess, but there were reports before of >> G1-caused instabilities with small heap sizes (< 16GB - see CASSANDRA-10403 >> for more context). Please ignore if you already tried reverting back to CMS. >> >> 2016-02-10 16:51 GMT-03:00 Mike Heffner : >> >>> Hi all, >>> >>> We've recently embarked on a project to update our Cassandra >>> infrastructure running on EC2. We are long time users of 2.0.x and are >>> testing out a move to version 2.2.5 running on VPC with EBS. Our test setup >>> is a 3 node, RF=3 cluster supporting a small write load (mirror of our >>> staging load). >>> >>> We are writing at QUORUM and while p95's look good compared to our >>> staging 2.0.x cluster, we are seeing frequent write operations that time >>> out at the max write_request_timeout_in_ms (10 seconds). CPU across the >>> cluster is < 10% and EBS write load is < 100 IOPS. Cassandra is running >>> with the Oracle JDK 8u60 and we're using G1GC and any GC pauses are less >>> than 500ms. >>> >>> We run on c4.2xl instances with GP2 EBS attached storage for data and >>> commitlog directories. The nodes are using EC2 enhanced networking and have >>> the latest Intel network driver module. We are running on HVM instances >>> using Ubuntu 14.04.2. >>> >>> Our schema is 5 tables, all with COMPACT STORAGE. Each table is similar >>> to the definition here: >>> https://gist.github.com/mheffner/4d80f6b53ccaa24cc20a >>> >>> This is our cassandra.yaml: >>> https://gist.github.com/mheffner/fea80e6e939dd483f94f#file-cassandra-yaml >>> >>> Like I mentioned we use 8u60 with G1GC and have used many of the GC >>> settings in Al Tobey's tuning guide. This is our upstart config with JVM >>> and other CPU settings: >>> https://gist.github.com/mheffner/dc44613620b25c4fa46d >>> >>> We've used several of the sysctl settings from Al's guide as well: >>> https://gist.github.com/mheffner/ea40d58f58a517028152 >>> >>> Our client application is able to write using either Thrift batches >>> using Asytanax driver or CQL async INSERT's using the Datastax Java driver. >>> >>> For testing against Thrift (our legacy infra uses this) we write batches >>> of anywhere from 6 to 1500 rows at a time. Our p99 for batch execution is >>> around 45ms but our maximum (p100) sits less than 150ms except when it >>> periodically spikes to the full 10seconds. >>> >>> Testing the same write path using CQL writes instead demonstrates >>> similar behavior. Low p99s except for periodic full timeouts. We enabled >>> tracing for several operations but were unable to get a trace that >>> completed successfully -- Cassandra started logging many messages as: >>> >>> INFO [ScheduledTasks:1] - MessagingService.java:946 - _TRACE messages >>> were dropped in last 5000 ms: 52499 for internal timeout and 0 for cross >>> node timeout >>> >>> And all the traces contained rows with a "null" source_elapsed row: >>> https://gist.githubusercontent.com/mheffner/1d68a70449bd6688a010/raw/0327d7d3d94c3a93af02b64212e3b7e7d8f2911b/trace.out >>> >>> >>> We've exhausted as many configuration option permutations that we can >>> think of. This cluster does not appear to be under any significant load and >>> latencies seem to largely fall in two bands: low normal or max timeout. >>> This seems to imply that something is getting stuck and timing out at the >>> max write timeout. >>> >>> Any suggestions on what to look for? We had debug enabled for awhile but >>> we didn't see any msg that pointed to something obvious. Happy to provide >>> any more information that may help. >>> >>> We are pretty much at the point of sprinkling debug around the code to >>> track down what could be blocking. >>> >>> >>> Thanks, >>> >>> Mike >>> >>> -- >>> >>> Mike Heffner >>> Librato, Inc. >>> >>> >> > > > -- > > Mike Heffner > Librato, Inc. > > -- Mike Heffner Librato, Inc.
Re: Debugging write timeouts on Cassandra 2.2.5
Jaydeep, No, we don't use any light weight transactions. Mike On Wed, Feb 17, 2016 at 6:44 PM, Jaydeep Chovatia < chovatia.jayd...@gmail.com> wrote: > Are you guys using light weight transactions in your write path? > > On Thu, Feb 11, 2016 at 12:36 AM, Fabrice Facorat < > fabrice.faco...@gmail.com> wrote: > >> Are your commitlog and data on the same disk ? If yes, you should put >> commitlogs on a separate disk which don't have a lot of IO. >> >> Others IO may have great impact impact on your commitlog writing and >> it may even block. >> >> An example of impact IO may have, even for Async writes: >> >> https://engineering.linkedin.com/blog/2016/02/eliminating-large-jvm-gc-pauses-caused-by-background-io-traffic >> >> 2016-02-11 0:31 GMT+01:00 Mike Heffner : >> > Jeff, >> > >> > We have both commitlog and data on a 4TB EBS with 10k IOPS. >> > >> > Mike >> > >> > On Wed, Feb 10, 2016 at 5:28 PM, Jeff Jirsa > > >> > wrote: >> >> >> >> What disk size are you using? >> >> >> >> >> >> >> >> From: Mike Heffner >> >> Reply-To: "user@cassandra.apache.org" >> >> Date: Wednesday, February 10, 2016 at 2:24 PM >> >> To: "user@cassandra.apache.org" >> >> Cc: Peter Norton >> >> Subject: Re: Debugging write timeouts on Cassandra 2.2.5 >> >> >> >> Paulo, >> >> >> >> Thanks for the suggestion, we ran some tests against CMS and saw the >> same >> >> timeouts. On that note though, we are going to try doubling the >> instance >> >> sizes and testing with double the heap (even though current usage is >> low). >> >> >> >> Mike >> >> >> >> On Wed, Feb 10, 2016 at 3:40 PM, Paulo Motta > > >> >> wrote: >> >>> >> >>> Are you using the same GC settings as the staging 2.0 cluster? If not, >> >>> could you try using the default GC settings (CMS) and see if that >> changes >> >>> anything? This is just a wild guess, but there were reports before of >> >>> G1-caused instabilities with small heap sizes (< 16GB - see >> CASSANDRA-10403 >> >>> for more context). Please ignore if you already tried reverting back >> to CMS. >> >>> >> >>> 2016-02-10 16:51 GMT-03:00 Mike Heffner : >> >>>> >> >>>> Hi all, >> >>>> >> >>>> We've recently embarked on a project to update our Cassandra >> >>>> infrastructure running on EC2. We are long time users of 2.0.x and >> are >> >>>> testing out a move to version 2.2.5 running on VPC with EBS. Our >> test setup >> >>>> is a 3 node, RF=3 cluster supporting a small write load (mirror of >> our >> >>>> staging load). >> >>>> >> >>>> We are writing at QUORUM and while p95's look good compared to our >> >>>> staging 2.0.x cluster, we are seeing frequent write operations that >> time out >> >>>> at the max write_request_timeout_in_ms (10 seconds). CPU across the >> cluster >> >>>> is < 10% and EBS write load is < 100 IOPS. Cassandra is running with >> the >> >>>> Oracle JDK 8u60 and we're using G1GC and any GC pauses are less than >> 500ms. >> >>>> >> >>>> We run on c4.2xl instances with GP2 EBS attached storage for data and >> >>>> commitlog directories. The nodes are using EC2 enhanced networking >> and have >> >>>> the latest Intel network driver module. We are running on HVM >> instances >> >>>> using Ubuntu 14.04.2. >> >>>> >> >>>> Our schema is 5 tables, all with COMPACT STORAGE. Each table is >> similar >> >>>> to the definition here: >> >>>> https://gist.github.com/mheffner/4d80f6b53ccaa24cc20a >> >>>> >> >>>> This is our cassandra.yaml: >> >>>> >> https://gist.github.com/mheffner/fea80e6e939dd483f94f#file-cassandra-yaml >> >>>> >> >>>> Like I mentioned we use 8u60 with G1GC and have used many of the GC >> >>>> settings in Al Tobey's tuning guide. This is our upstart config with >> JVM and >> >>>> other CPU settings: >> https://gist.github.com
Re: Debugging write timeouts on Cassandra 2.2.5
Following up from our earlier post... We have continued to do exhaustive testing and measuring of the numerous hardware and configuration variables here. What we have uncovered is that on identical hardware (including the configuration we run in production), something between versions 2.0.17 and 2.1.13 introduced this write timeout for our workload. We still aren't any closer to identifying the what or why, but it is easily reproduced using our workload when we bump to the 2.1.x release line. At the moment we are going to focus on hardening this new hardware configuration using the 2.0.17 release and roll it out internally to some of our production rings. We also want to bisect the 2.1.x release line to find if there was a particular point release that introduced the timeout. If anyone has suggestions for particular changes to look out for we'd be happy to focus a test on that earlier. Thanks, Mike On Wed, Feb 10, 2016 at 2:51 PM, Mike Heffner wrote: > Hi all, > > We've recently embarked on a project to update our Cassandra > infrastructure running on EC2. We are long time users of 2.0.x and are > testing out a move to version 2.2.5 running on VPC with EBS. Our test setup > is a 3 node, RF=3 cluster supporting a small write load (mirror of our > staging load). > > We are writing at QUORUM and while p95's look good compared to our staging > 2.0.x cluster, we are seeing frequent write operations that time out at the > max write_request_timeout_in_ms (10 seconds). CPU across the cluster is < > 10% and EBS write load is < 100 IOPS. Cassandra is running with the Oracle > JDK 8u60 and we're using G1GC and any GC pauses are less than 500ms. > > We run on c4.2xl instances with GP2 EBS attached storage for data and > commitlog directories. The nodes are using EC2 enhanced networking and have > the latest Intel network driver module. We are running on HVM instances > using Ubuntu 14.04.2. > > Our schema is 5 tables, all with COMPACT STORAGE. Each table is similar to > the definition here: https://gist.github.com/mheffner/4d80f6b53ccaa24cc20a > > This is our cassandra.yaml: > https://gist.github.com/mheffner/fea80e6e939dd483f94f#file-cassandra-yaml > > Like I mentioned we use 8u60 with G1GC and have used many of the GC > settings in Al Tobey's tuning guide. This is our upstart config with JVM > and other CPU settings: > https://gist.github.com/mheffner/dc44613620b25c4fa46d > > We've used several of the sysctl settings from Al's guide as well: > https://gist.github.com/mheffner/ea40d58f58a517028152 > > Our client application is able to write using either Thrift batches using > Asytanax driver or CQL async INSERT's using the Datastax Java driver. > > For testing against Thrift (our legacy infra uses this) we write batches > of anywhere from 6 to 1500 rows at a time. Our p99 for batch execution is > around 45ms but our maximum (p100) sits less than 150ms except when it > periodically spikes to the full 10seconds. > > Testing the same write path using CQL writes instead demonstrates similar > behavior. Low p99s except for periodic full timeouts. We enabled tracing > for several operations but were unable to get a trace that completed > successfully -- Cassandra started logging many messages as: > > INFO [ScheduledTasks:1] - MessagingService.java:946 - _TRACE messages > were dropped in last 5000 ms: 52499 for internal timeout and 0 for cross > node timeout > > And all the traces contained rows with a "null" source_elapsed row: > https://gist.githubusercontent.com/mheffner/1d68a70449bd6688a010/raw/0327d7d3d94c3a93af02b64212e3b7e7d8f2911b/trace.out > > > We've exhausted as many configuration option permutations that we can > think of. This cluster does not appear to be under any significant load and > latencies seem to largely fall in two bands: low normal or max timeout. > This seems to imply that something is getting stuck and timing out at the > max write timeout. > > Any suggestions on what to look for? We had debug enabled for awhile but > we didn't see any msg that pointed to something obvious. Happy to provide > any more information that may help. > > We are pretty much at the point of sprinkling debug around the code to > track down what could be blocking. > > > Thanks, > > Mike > > -- > > Mike Heffner > Librato, Inc. > > -- Mike Heffner Librato, Inc.
Re: Debugging write timeouts on Cassandra 2.2.5
Alain, Thanks for the suggestions. Sure, tpstats are here: https://gist.github.com/mheffner/a979ae1a0304480b052a. Looking at the metrics across the ring, there were no blocked tasks nor dropped messages. Iowait metrics look fine, so it doesn't appear to be blocking on disk. Similarly, there are no long GC pauses. We haven't noticed latency on any particular table higher than others or correlated around the occurrence of a timeout. We have noticed with further testing that running cassandra-stress against the ring, while our workload is writing to the same ring, will incur similar 10 second timeouts. If our workload is not writing to the ring, cassandra stress will run without hitting timeouts. This seems to imply that our workload pattern is causing something to block cluster-wide, since the stress tool writes to a different keyspace then our workload. I mentioned in another reply that we've tracked it to something between 2.0.x and 2.1.x, so we are focusing on narrowing which point release it was introduced in. Cheers, Mike On Thu, Feb 18, 2016 at 3:33 AM, Alain RODRIGUEZ wrote: > Hi Mike, > > What about the output of tpstats ? I imagine you have dropped messages > there. Any blocked threads ? Could you paste this output here ? > > May this be due to some network hiccup to access the disks as they are EBS > ? Can you think of anyway of checking this ? Do you have a lot of GC logs, > how long are the pauses (use something like: grep -i 'GCInspector' > /var/log/cassandra/system.log) ? > > Something else you could check are local_writes stats to see if only one > table if affected or this is keyspace / cluster wide. You can use metrics > exposed by cassandra or if you have no dashboards I believe a: 'nodetool > cfstats | grep -e 'Table:' -e 'Local'' should give you a rough idea > of local latencies. > > Those are just things I would check, I have not a clue on what is > happening here, hope this will help. > > C*heers, > - > Alain Rodriguez > France > > The Last Pickle > http://www.thelastpickle.com > > 2016-02-18 5:13 GMT+01:00 Mike Heffner : > >> Jaydeep, >> >> No, we don't use any light weight transactions. >> >> Mike >> >> On Wed, Feb 17, 2016 at 6:44 PM, Jaydeep Chovatia < >> chovatia.jayd...@gmail.com> wrote: >> >>> Are you guys using light weight transactions in your write path? >>> >>> On Thu, Feb 11, 2016 at 12:36 AM, Fabrice Facorat < >>> fabrice.faco...@gmail.com> wrote: >>> >>>> Are your commitlog and data on the same disk ? If yes, you should put >>>> commitlogs on a separate disk which don't have a lot of IO. >>>> >>>> Others IO may have great impact impact on your commitlog writing and >>>> it may even block. >>>> >>>> An example of impact IO may have, even for Async writes: >>>> >>>> https://engineering.linkedin.com/blog/2016/02/eliminating-large-jvm-gc-pauses-caused-by-background-io-traffic >>>> >>>> 2016-02-11 0:31 GMT+01:00 Mike Heffner : >>>> > Jeff, >>>> > >>>> > We have both commitlog and data on a 4TB EBS with 10k IOPS. >>>> > >>>> > Mike >>>> > >>>> > On Wed, Feb 10, 2016 at 5:28 PM, Jeff Jirsa < >>>> jeff.ji...@crowdstrike.com> >>>> > wrote: >>>> >> >>>> >> What disk size are you using? >>>> >> >>>> >> >>>> >> >>>> >> From: Mike Heffner >>>> >> Reply-To: "user@cassandra.apache.org" >>>> >> Date: Wednesday, February 10, 2016 at 2:24 PM >>>> >> To: "user@cassandra.apache.org" >>>> >> Cc: Peter Norton >>>> >> Subject: Re: Debugging write timeouts on Cassandra 2.2.5 >>>> >> >>>> >> Paulo, >>>> >> >>>> >> Thanks for the suggestion, we ran some tests against CMS and saw the >>>> same >>>> >> timeouts. On that note though, we are going to try doubling the >>>> instance >>>> >> sizes and testing with double the heap (even though current usage is >>>> low). >>>> >> >>>> >> Mike >>>> >> >>>> >> On Wed, Feb 10, 2016 at 3:40 PM, Paulo Motta < >>>> pauloricard...@gmail.com> >>>> >> wrote: >>>> >>> >>>> >>> Are you using the same GC settings a
Re: Debugging write timeouts on Cassandra 2.2.5
Anuj, So we originally started testing with Java8 + G1, however we were able to reproduce the same results with the default CMS settings that ship in the cassandra-env.sh from the Deb pkg. We didn't detect any large GC pauses during the runs. Query pattern during our testing was 100% writes, batching (via Thrift mostly) to 5 tables, between 6-1500 rows per batch. Mike On Thu, Feb 18, 2016 at 12:22 PM, Anuj Wadehra wrote: > Whats the GC overhead? Can you your share your GC collector and settings ? > > > Whats your query pattern? Do you use secondary indexes, batches, in clause > etc? > > > Anuj > > > Sent from Yahoo Mail on Android > <https://overview.mail.yahoo.com/mobile/?.src=Android> > > On Thu, 18 Feb, 2016 at 8:45 pm, Mike Heffner > wrote: > Alain, > > Thanks for the suggestions. > > Sure, tpstats are here: > https://gist.github.com/mheffner/a979ae1a0304480b052a. Looking at the > metrics across the ring, there were no blocked tasks nor dropped messages. > > Iowait metrics look fine, so it doesn't appear to be blocking on disk. > Similarly, there are no long GC pauses. > > We haven't noticed latency on any particular table higher than others or > correlated around the occurrence of a timeout. We have noticed with further > testing that running cassandra-stress against the ring, while our workload > is writing to the same ring, will incur similar 10 second timeouts. If our > workload is not writing to the ring, cassandra stress will run without > hitting timeouts. This seems to imply that our workload pattern is causing > something to block cluster-wide, since the stress tool writes to a > different keyspace then our workload. > > I mentioned in another reply that we've tracked it to something between > 2.0.x and 2.1.x, so we are focusing on narrowing which point release it was > introduced in. > > Cheers, > > Mike > > On Thu, Feb 18, 2016 at 3:33 AM, Alain RODRIGUEZ > wrote: > >> Hi Mike, >> >> What about the output of tpstats ? I imagine you have dropped messages >> there. Any blocked threads ? Could you paste this output here ? >> >> May this be due to some network hiccup to access the disks as they are >> EBS ? Can you think of anyway of checking this ? Do you have a lot of GC >> logs, how long are the pauses (use something like: grep -i 'GCInspector' >> /var/log/cassandra/system.log) ? >> >> Something else you could check are local_writes stats to see if only one >> table if affected or this is keyspace / cluster wide. You can use metrics >> exposed by cassandra or if you have no dashboards I believe a: 'nodetool >> cfstats | grep -e 'Table:' -e 'Local'' should give you a rough idea >> of local latencies. >> >> Those are just things I would check, I have not a clue on what is >> happening here, hope this will help. >> >> C*heers, >> - >> Alain Rodriguez >> France >> >> The Last Pickle >> http://www.thelastpickle.com >> >> 2016-02-18 5:13 GMT+01:00 Mike Heffner : >> >>> Jaydeep, >>> >>> No, we don't use any light weight transactions. >>> >>> Mike >>> >>> On Wed, Feb 17, 2016 at 6:44 PM, Jaydeep Chovatia < >>> chovatia.jayd...@gmail.com> wrote: >>> >>>> Are you guys using light weight transactions in your write path? >>>> >>>> On Thu, Feb 11, 2016 at 12:36 AM, Fabrice Facorat < >>>> fabrice.faco...@gmail.com> wrote: >>>> >>>>> Are your commitlog and data on the same disk ? If yes, you should put >>>>> commitlogs on a separate disk which don't have a lot of IO. >>>>> >>>>> Others IO may have great impact impact on your commitlog writing and >>>>> it may even block. >>>>> >>>>> An example of impact IO may have, even for Async writes: >>>>> >>>>> https://engineering.linkedin.com/blog/2016/02/eliminating-large-jvm-gc-pauses-caused-by-background-io-traffic >>>>> >>>>> 2016-02-11 0:31 GMT+01:00 Mike Heffner : >>>>> > Jeff, >>>>> > >>>>> > We have both commitlog and data on a 4TB EBS with 10k IOPS. >>>>> > >>>>> > Mike >>>>> > >>>>> > On Wed, Feb 10, 2016 at 5:28 PM, Jeff Jirsa < >>>>> jeff.ji...@crowdstrike.com> >>>>> > wrote: >>>>> >> >>>>> >> What disk size are you using?
Re: Debugging write timeouts on Cassandra 2.2.5
Nate, So we have run several install tests, bisecting the 2.1.x release line, and we believe that the regression was introduced in version 2.1.5. This is the first release that clearly hits the timeout for us. It looks like quite a large release, so our next step will likely be bisecting the major commits to see if we can narrow it down: https://github.com/apache/cassandra/blob/3c0a337ebc90b0d99349d0aa152c92b5b3494d8c/CHANGES.txt. Obviously, any suggestions on potential suspects appreciated. These are the memtable settings we've configured diff from the defaults during our testing: memtable_allocation_type: offheap_objects memtable_flush_writers: 8 Cheers, Mike On Fri, Feb 19, 2016 at 1:46 PM, Nate McCall wrote: > The biggest change which *might* explain your behavior has to do with the > changes in memtable flushing between 2.0 and 2.1: > https://issues.apache.org/jira/browse/CASSANDRA-5549 > > However, the tpstats you posted shows no dropped mutations which would > make me more certain of this as the cause. > > What values do you have right now for each of these (my recommendations > for each on a c4.2xl with stock cassandra-env.sh are in parenthesis): > > - memtable_flush_writers (2) > - memtable_heap_space_in_mb (2048) > - memtable_offheap_space_in_mb (2048) > - memtable_cleanup_threshold (0.11) > - memtable_allocation_type (offheap_objects) > > The biggest win IMO will be moving to offheap_objects. By default, > everything is on heap. Regardless, spending some time tuning these for your > workload will pay off. > > You may also want to be explicit about > > - native_transport_max_concurrent_connections > - native_transport_max_concurrent_connections_per_ip > > Depending on the driver, these may now be allowing 32k streams per > connection(!) as detailed in v3 of the native protocol: > > https://github.com/apache/cassandra/blob/cassandra-2.1/doc/native_protocol_v3.spec#L130-L152 > > > > On Fri, Feb 19, 2016 at 8:48 AM, Mike Heffner wrote: > >> Anuj, >> >> So we originally started testing with Java8 + G1, however we were able to >> reproduce the same results with the default CMS settings that ship in the >> cassandra-env.sh from the Deb pkg. We didn't detect any large GC pauses >> during the runs. >> >> Query pattern during our testing was 100% writes, batching (via Thrift >> mostly) to 5 tables, between 6-1500 rows per batch. >> >> Mike >> >> On Thu, Feb 18, 2016 at 12:22 PM, Anuj Wadehra >> wrote: >> >>> Whats the GC overhead? Can you your share your GC collector and settings >>> ? >>> >>> >>> Whats your query pattern? Do you use secondary indexes, batches, in >>> clause etc? >>> >>> >>> Anuj >>> >>> >>> Sent from Yahoo Mail on Android >>> <https://overview.mail.yahoo.com/mobile/?.src=Android> >>> >>> On Thu, 18 Feb, 2016 at 8:45 pm, Mike Heffner >>> wrote: >>> Alain, >>> >>> Thanks for the suggestions. >>> >>> Sure, tpstats are here: >>> https://gist.github.com/mheffner/a979ae1a0304480b052a. Looking at the >>> metrics across the ring, there were no blocked tasks nor dropped messages. >>> >>> Iowait metrics look fine, so it doesn't appear to be blocking on disk. >>> Similarly, there are no long GC pauses. >>> >>> We haven't noticed latency on any particular table higher than others or >>> correlated around the occurrence of a timeout. We have noticed with further >>> testing that running cassandra-stress against the ring, while our workload >>> is writing to the same ring, will incur similar 10 second timeouts. If our >>> workload is not writing to the ring, cassandra stress will run without >>> hitting timeouts. This seems to imply that our workload pattern is causing >>> something to block cluster-wide, since the stress tool writes to a >>> different keyspace then our workload. >>> >>> I mentioned in another reply that we've tracked it to something between >>> 2.0.x and 2.1.x, so we are focusing on narrowing which point release it was >>> introduced in. >>> >>> Cheers, >>> >>> Mike >>> >>> On Thu, Feb 18, 2016 at 3:33 AM, Alain RODRIGUEZ >>> wrote: >>> >>>> Hi Mike, >>>> >>>> What about the output of tpstats ? I imagine you have dropped messages >>>> there. Any blocked threads ? Could you paste this output here ? >>>> >>>> May this be due to some network hiccup to access the dis
Re: Consistent read timeouts for bursts of reads
Emils, I realize this may be a big downgrade, but are you timeouts reproducible under Cassandra 2.1.4? Mike On Thu, Feb 25, 2016 at 10:34 AM, Emīls Šolmanis wrote: > Having had a read through the archives, I missed this at first, but this > seems to be *exactly* like what we're experiencing. > > http://www.mail-archive.com/user@cassandra.apache.org/msg46064.html > > Only difference is we're getting this for reads and using CQL, but the > behaviour is identical. > > On Thu, 25 Feb 2016 at 14:55 Emīls Šolmanis > wrote: > >> Hello, >> >> We're having a problem with concurrent requests. It seems that whenever >> we try resolving more >> than ~ 15 queries at the same time, one or two get a read timeout and >> then succeed on a retry. >> >> We're running Cassandra 2.2.4 accessed via the 2.1.9 Datastax driver on >> AWS. >> >> What we've found while investigating: >> >> * this is not db-wide. Trying the same pattern against another table >> everything works fine. >> * it fails 1 or 2 requests regardless of how many are executed in >> parallel, i.e., it's still 1 or 2 when we ramp it up to ~ 120 concurrent >> requests and doesn't seem to scale up. >> * the problem is consistently reproducible. It happens both under >> heavier load and when just firing off a single batch of requests for >> testing. >> * tracing the faulty requests says everything is great. An example >> trace: https://gist.github.com/emilssolmanis/41e1e2ecdfd9a0569b1a >> * the only peculiar thing in the logs is there's no acknowledgement of >> the request being accepted by the server, as seen in >> https://gist.github.com/emilssolmanis/242d9d02a6d8fb91da8a >> * there's nothing funny in the timed out Cassandra node's logs around >> that time as far as I can tell, not even in the debug logs. >> >> Any ideas about what might be causing this, pointers to server config >> options, or how else we might debug this would be much appreciated. >> >> Kind regards, >> Emils >> >> -- Mike Heffner Librato, Inc.
Re: Consistent read timeouts for bursts of reads
Emils, We believe we've tracked it down to the following issue: https://issues.apache.org/jira/browse/CASSANDRA-11302, introduced in 2.1.5. We are running a build of 2.2.5 with that patch and so far have not seen any more timeouts. Mike On Fri, Mar 4, 2016 at 3:14 AM, Emīls Šolmanis wrote: > Mike, > > Is that where you've bisected it to having been introduced? > > I'll see what I can do, but doubt it, since we've long since upgraded prod > to 2.2.4 (and stage before that) and the tests I'm running were for a new > feature. > > > On Fri, 4 Mar 2016 03:54 Mike Heffner, wrote: > >> Emils, >> >> I realize this may be a big downgrade, but are you timeouts reproducible >> under Cassandra 2.1.4? >> >> Mike >> >> On Thu, Feb 25, 2016 at 10:34 AM, Emīls Šolmanis < >> emils.solma...@gmail.com> wrote: >> >>> Having had a read through the archives, I missed this at first, but this >>> seems to be *exactly* like what we're experiencing. >>> >>> http://www.mail-archive.com/user@cassandra.apache.org/msg46064.html >>> >>> Only difference is we're getting this for reads and using CQL, but the >>> behaviour is identical. >>> >>> On Thu, 25 Feb 2016 at 14:55 Emīls Šolmanis >>> wrote: >>> >>>> Hello, >>>> >>>> We're having a problem with concurrent requests. It seems that whenever >>>> we try resolving more >>>> than ~ 15 queries at the same time, one or two get a read timeout and >>>> then succeed on a retry. >>>> >>>> We're running Cassandra 2.2.4 accessed via the 2.1.9 Datastax driver on >>>> AWS. >>>> >>>> What we've found while investigating: >>>> >>>> * this is not db-wide. Trying the same pattern against another table >>>> everything works fine. >>>> * it fails 1 or 2 requests regardless of how many are executed in >>>> parallel, i.e., it's still 1 or 2 when we ramp it up to ~ 120 concurrent >>>> requests and doesn't seem to scale up. >>>> * the problem is consistently reproducible. It happens both under >>>> heavier load and when just firing off a single batch of requests for >>>> testing. >>>> * tracing the faulty requests says everything is great. An example >>>> trace: https://gist.github.com/emilssolmanis/41e1e2ecdfd9a0569b1a >>>> * the only peculiar thing in the logs is there's no acknowledgement of >>>> the request being accepted by the server, as seen in >>>> https://gist.github.com/emilssolmanis/242d9d02a6d8fb91da8a >>>> * there's nothing funny in the timed out Cassandra node's logs around >>>> that time as far as I can tell, not even in the debug logs. >>>> >>>> Any ideas about what might be causing this, pointers to server config >>>> options, or how else we might debug this would be much appreciated. >>>> >>>> Kind regards, >>>> Emils >>>> >>>> >> >> >> -- >> >> Mike Heffner >> Librato, Inc. >> >> -- Mike Heffner Librato, Inc.
Migrating data from a 0.8.8 -> 1.1.2 ring
Hi, We are migrating from a 0.8.8 ring to a 1.1.2 ring and we are noticing missing data post-migration. We use pre-built/configured AMIs so our preferred route is to leave our existing production 0.8.8 untouched and bring up a parallel 1.1.2 ring and migrate data into it. Data is written to the rings via batch processes so we can easily assure that both the existing and new rings will have the same data post migration. The ring we are migrating from is: * 12 nodes * single data-center, 3 AZs * 0.8.8 The ring we are migrating to is the same except 1.1.2. The steps we are taking are: 1. Bring up a 1.1.2 ring in the same AZ/data center configuration with tokens matching the corresponding nodes in the 0.8.8 ring. 2. Create the same keyspace on 1.1.2. 3. Create each CF in the keyspace on 1.1.2. 4. Flush each node of the 0.8.8 ring. 5. Rsync each non-compacted sstable from 0.8.8 to the corresponding node in 1.1.2. 6. Move each 0.8.8 sstable into the 1.1.2 directory structure by renaming the file to the /cassandra/data///-... format. For example, for the keyspace "Metrics" and CF "epochs_60" we get: "cassandra/data/Metrics/epochs_60/Metrics-epochs_60-g-941-Data.db". 7. On each 1.1.2 node run `nodetool -h localhost refresh Metrics ` for each CF in the keyspace. We notice that storage load jumps accordingly. 8. On each 1.1.2 node run `nodetool -h localhost upgradesstables`. This takes awhile but appears to correctly rewrite each sstable in the new 1.1.x format. Storage load drops as sstables are compressed. After these steps we run a script that validates data on the new ring. What we've noticed is that large portions of the data that was on the 0.8.8 is not available on the 1.1.2 ring. We've tried reading at both quorum and ONE, but the resulting data appears missing in both cases. We have fewer than 143 million row keys in the CFs we're testing and none of the *-Filter.db files are > 10MB, so I don't believe this is our problem: https://issues.apache.org/jira/browse/CASSANDRA-3820 Anything else to test verify? Are the steps above correct for this type of upgrade? Is this type of upgrade/migration supported? We have also tried running a repair across the cluster after step #8. While it took a few retries due to https://issues.apache.org/jira/browse/CASSANDRA-4456, we still had missing data afterwards. Any assistance would be appreciated. Thanks! Mike -- Mike Heffner Librato, Inc.
Re: Migrating data from a 0.8.8 -> 1.1.2 ring
On Mon, Jul 23, 2012 at 1:25 PM, Mike Heffner wrote: > Hi, > > We are migrating from a 0.8.8 ring to a 1.1.2 ring and we are noticing > missing data post-migration. We use pre-built/configured AMIs so our > preferred route is to leave our existing production 0.8.8 untouched and > bring up a parallel 1.1.2 ring and migrate data into it. Data is written to > the rings via batch processes so we can easily assure that both the > existing and new rings will have the same data post migration. > > > The steps we are taking are: > > 1. Bring up a 1.1.2 ring in the same AZ/data center configuration with > tokens matching the corresponding nodes in the 0.8.8 ring. > 2. Create the same keyspace on 1.1.2. > 3. Create each CF in the keyspace on 1.1.2. > 4. Flush each node of the 0.8.8 ring. > 5. Rsync each non-compacted sstable from 0.8.8 to the corresponding node > in 1.1.2. > 6. Move each 0.8.8 sstable into the 1.1.2 directory structure by renaming > the file to the /cassandra/data///-... format. > For example, for the keyspace "Metrics" and CF "epochs_60" we get: > "cassandra/data/Metrics/epochs_60/Metrics-epochs_60-g-941-Data.db". > 7. On each 1.1.2 node run `nodetool -h localhost refresh Metrics ` for > each CF in the keyspace. We notice that storage load jumps accordingly. > 8. On each 1.1.2 node run `nodetool -h localhost upgradesstables`. This > takes awhile but appears to correctly rewrite each sstable in the new 1.1.x > format. Storage load drops as sstables are compressed. > > So, after some further testing we've observed that the `upgradesstables` command is removing data from the sstables, leading to our missing data. We've repeated the steps above with several variations: WORKS refresh -> scrub WORKS refresh -> scrub -> major compaction FAILS refresh -> upgradesstables FAILS refresh -> scrub -> upgradesstables FAILS refresh -> scrub -> major compaction -> upgradesstables So, we are able to migrate our test CFs from a 0.8.8 ring to a 1.1.2 ring when we use scrub. However, whenever we run an upgradesstables command the sstables are shrunk significantly and our tests show missing data: INFO [CompactionExecutor:4] 2012-07-24 04:27:36,837 CompactionTask.java (line 109) Compacting [SSTableReader(path='/raid0/cassandra/data/Metrics/metrics_900/Metrics-metrics_900-hd-51-Data.db')] INFO [CompactionExecutor:4] 2012-07-24 04:27:51,090 CompactionTask.java (line 221) Compacted to [/raid0/cassandra/data/Metrics/metrics_900/Metrics-metrics_900-hd-58-Data.db,]. 60,449,155 to 2,578,102 (~4% of original) bytes for 4,002 keys at 0.172562MB/s. Time: 14,248ms. Is there a scenario where upgradesstables would remove data that a scrub command wouldn't? According the documentation, it would appear that the scrub command is actually more destructive than upgradesstables in terms of removing data. On 1.1.x, upgradesstables is the documented upgrade command over a scrub. The keyspace is defined as: Keyspace: Metrics: Replication Strategy: org.apache.cassandra.locator.NetworkTopologyStrategy Durable Writes: true Options: [us-east:3] And the column family above defined as: ColumnFamily: metrics_900 Key Validation Class: org.apache.cassandra.db.marshal.UTF8Type Default column value validator: org.apache.cassandra.db.marshal.BytesType Columns sorted by: org.apache.cassandra.db.marshal.CompositeType(org.apache.cassandra.db.marshal.LongType,org.apache.cassandra.db.marshal.UTF8Type,org.apache.cassandra.db.marshal.UTF8Type) GC grace seconds: 0 Compaction min/max thresholds: 4/32 Read repair chance: 0.1 DC Local Read repair chance: 0.0 Replicate on write: true Caching: KEYS_ONLY Bloom Filter FP chance: default Built indexes: [] Compaction Strategy: org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy Compression Options: sstable_compression: org.apache.cassandra.io.compress.SnappyCompressor All rows have a TTL of 30 days, so it's possible that, along with the gc_grace=0, a small number would be removed during a compaction/scrub/upgradesstables step. However, the majority should still be kept as their TTL has not expired yet. We are still experimenting to see under what conditions this happens, but I thought I'd send out some more info in case there is something clearly wrong we're doing here. Thanks, Mike -- Mike Heffner Librato, Inc.
Composite Column Slice query, wildcard first component?
Hello, Given a row like this "key1" => (A:A:C), (A:A:B), (B:A:C), (B:C:D) Is there a way to create a slice query that returns all columns where the _second_ component is A? That is, I would like to get back the following columns by asking for columns where component[0] = * and component[1] = A (A:A:C), (A:A:B), (B:A:C) I could do some iteration and figure this out in more of a brute force manner, I'm just curious if there's anything built in that might be more efficient Thanks! Mike
Re: Hinted Handoff runs every ten minutes
Is there a ticket open for this for 1.1.6? We also noticed this after upgrading from 1.1.3 to 1.1.6. Every node runs a 0 row hinted handoff every 10 minutes. N-1 nodes hint to the same node, while that node hints to another node. On Tue, Oct 30, 2012 at 1:35 PM, Vegard Berget wrote: > Hi, > > I have the exact same problem with 1.1.6. HintsColumnFamily consists of > one row (Rowkey 00, nothing more). The "problem" started after upgrading > from 1.1.4 to 1.1.6. Every ten minutes HintedHandoffManager starts and > finishes after sending "0 rows". > > .vegard, > > > > - Original Message - > From: > user@cassandra.apache.org > > To: > > Cc: > > Sent: > Mon, 29 Oct 2012 23:45:30 +0100 > > Subject: > Re: Hinted Handoff runs every ten minutes > > > Dne 29.10.2012 23:24, Stephen Pierce napsal(a): > > I'm running 1.1.5; the bug says it's fixed in 1.0.9/1.1.0. > > > > How can I check to see why it keeps running HintedHandoff? > you have tombstone is system.HintsColumnFamily use list command in > cassandra-cli to check > > -- Mike Heffner Librato, Inc.
Re: Upgrade 1.1.2 -> 1.1.6
Alain, We performed a 1.1.3 -> 1.1.6 upgrade and found that all the logs replayed regardless of the drain. After noticing this on the first node, we did the following: * nodetool flush * nodetool drain * service cassandra stop * mv /path/to/logs/*.log /backup/ * apt-get install cassandra I also agree that starting C* after an upgrade/install seems quite broken if it was already stopped before the install. However annoying, I have found this to be the default for most Ubuntu daemon packages. Mike On Thu, Nov 15, 2012 at 9:21 AM, Alain RODRIGUEZ wrote: > We had an issue with counters over-counting even using the nodetool drain > command before upgrading... > > Here is my bash history > >69 cp /etc/cassandra/cassandra.yaml /etc/cassandra/cassandra.yaml.bak >70 cp /etc/cassandra/cassandra-env.sh > /etc/cassandra/cassandra-env.sh.bak >71 sudo apt-get install cassandra >72 nodetool disablethrift >73 nodetool drain >74 service cassandra stop >75 cat /etc/cassandra/cassandra-env.sh > /etc/cassandra/cassandra-env.sh.bak >76 vim /etc/cassandra/cassandra-env.sh >77 cat /etc/cassandra/cassandra.yaml /etc/cassandra/cassandra.yaml.bak >78 vim /etc/cassandra/cassandra.yaml >79 service cassandra start > > So I think I followed these steps > http://www.datastax.com/docs/1.1/install/upgrading#upgrade-steps > > I merged my conf files with an external tool so consider I merged my conf > files on steps 76 and 78. > > I saw that the "sudo apt-get install cassandra" stop the server and > restart it automatically. So it updated without draining and restart before > I had the time to reconfigure the conf files. Is this "normal" ? Is there a > way to avoid it ? > > So for the second node I decided to try to stop C*before the upgrade. > > 125 cp /etc/cassandra/cassandra.yaml /etc/cassandra/cassandra.yaml.bak > 126 cp /etc/cassandra/cassandra-env.sh > /etc/cassandra/cassandra-env.sh.bak > 127 nodetool disablegossip > 128 nodetool disablethrift > 129 nodetool drain > 130 service cassandra stop > 131 sudo apt-get install cassandra > > //131 : This restarted cassandra > > 132 nodetool disablethrift > 133 nodetool disablegossip > 134 nodetool drain > 135 service cassandra stop > 136 cat /etc/cassandra/cassandra-env.sh > /etc/cassandra/cassandra-env.sh.bak > 137 cim /etc/cassandra/cassandra-env.sh > 138 vim /etc/cassandra/cassandra-env.sh > 139 cat /etc/cassandra/cassandra.yaml /etc/cassandra/cassandra.yaml.bak > 140 vim /etc/cassandra/cassandra.yaml > 141 service cassandra start > > After both of these updates I saw my current counters increase without any > reason. > > Did I do anything wrong ? > > Alain > > -- Mike Heffner Librato, Inc.
Re: Upgrade 1.1.2 -> 1.1.6
Alain, My understanding is that drain ensures that all memtables are flushed, so that there is no data in the commitlog that is isn't in an sstable. A marker is saved that indicates the commit logs should not be replayed. Commitlogs are only removed from disk periodically (after commitlog_total_space_in_mb is exceeded?). With 1.1.5/6, all nanotime commitlogs are replayed on startup regardless of whether they've been flushed. So in our case manually removing all the commitlogs after a drain was the only way to prevent their replay. Mike On Tue, Nov 20, 2012 at 5:19 AM, Alain RODRIGUEZ wrote: > @Mike > > I am glad to see I am not the only one with this issue (even if I am sorry > it happened to you of course.). > > Isn't drain supposed to clear the commit logs ? Did removing them worked > properly ? > > I his warning to C* users, Jonathan Ellis told that a drain would avoid > this issue, It seems like it doesn't. > > @Rob > > You understood precisely the 2 issues I met during the upgrade. I am sad > to see none of them is yet resolved and probably wont. > > > 2012/11/20 Mike Heffner > >> Alain, >> >> We performed a 1.1.3 -> 1.1.6 upgrade and found that all the logs >> replayed regardless of the drain. After noticing this on the first node, we >> did the following: >> >> * nodetool flush >> * nodetool drain >> * service cassandra stop >> * mv /path/to/logs/*.log /backup/ >> * apt-get install cassandra >> >> >> I also agree that starting C* after an upgrade/install seems quite broken >> if it was already stopped before the install. However annoying, I have >> found this to be the default for most Ubuntu daemon packages. >> >> Mike >> >> >> On Thu, Nov 15, 2012 at 9:21 AM, Alain RODRIGUEZ wrote: >> >>> We had an issue with counters over-counting even using the nodetool >>> drain command before upgrading... >>> >>> Here is my bash history >>> >>>69 cp /etc/cassandra/cassandra.yaml /etc/cassandra/cassandra.yaml.bak >>>70 cp /etc/cassandra/cassandra-env.sh >>> /etc/cassandra/cassandra-env.sh.bak >>>71 sudo apt-get install cassandra >>>72 nodetool disablethrift >>>73 nodetool drain >>>74 service cassandra stop >>>75 cat /etc/cassandra/cassandra-env.sh >>> /etc/cassandra/cassandra-env.sh.bak >>>76 vim /etc/cassandra/cassandra-env.sh >>>77 cat /etc/cassandra/cassandra.yaml >>> /etc/cassandra/cassandra.yaml.bak >>>78 vim /etc/cassandra/cassandra.yaml >>>79 service cassandra start >>> >>> So I think I followed these steps >>> http://www.datastax.com/docs/1.1/install/upgrading#upgrade-steps >>> >>> I merged my conf files with an external tool so consider I merged my >>> conf files on steps 76 and 78. >>> >>> I saw that the "sudo apt-get install cassandra" stop the server and >>> restart it automatically. So it updated without draining and restart before >>> I had the time to reconfigure the conf files. Is this "normal" ? Is there a >>> way to avoid it ? >>> >>> So for the second node I decided to try to stop C*before the upgrade. >>> >>> 125 cp /etc/cassandra/cassandra.yaml /etc/cassandra/cassandra.yaml.bak >>> 126 cp /etc/cassandra/cassandra-env.sh >>> /etc/cassandra/cassandra-env.sh.bak >>> 127 nodetool disablegossip >>> 128 nodetool disablethrift >>> 129 nodetool drain >>> 130 service cassandra stop >>> 131 sudo apt-get install cassandra >>> >>> //131 : This restarted cassandra >>> >>> 132 nodetool disablethrift >>> 133 nodetool disablegossip >>> 134 nodetool drain >>> 135 service cassandra stop >>> 136 cat /etc/cassandra/cassandra-env.sh >>> /etc/cassandra/cassandra-env.sh.bak >>> 137 cim /etc/cassandra/cassandra-env.sh >>> 138 vim /etc/cassandra/cassandra-env.sh >>> 139 cat /etc/cassandra/cassandra.yaml >>> /etc/cassandra/cassandra.yaml.bak >>> 140 vim /etc/cassandra/cassandra.yaml >>> 141 service cassandra start >>> >>> After both of these updates I saw my current counters increase without >>> any reason. >>> >>> Did I do anything wrong ? >>> >>> Alain >>> >>> >> >> >> -- >> >> Mike Heffner >> Librato, Inc. >> >> >> > -- Mike Heffner Librato, Inc.
Re: Upgrade 1.1.2 -> 1.1.6
On Tue, Nov 20, 2012 at 2:49 PM, Rob Coli wrote: > On Mon, Nov 19, 2012 at 7:18 PM, Mike Heffner wrote: > > We performed a 1.1.3 -> 1.1.6 upgrade and found that all the logs > replayed > > regardless of the drain. > > Your experience and desire for different (expected) behavior is welcomed > on : > > https://issues.apache.org/jira/browse/CASSANDRA-4446 > > "nodetool drain sometimes doesn't mark commitlog fully flushed" > > If every production operator who experiences this issue shares their > experience on this bug, perhaps the project will acknowledge and > address it. > > Well in this case I think our issue was that upgrading from nanotime->epoch seconds, by definition, replays all commit logs. That's not due to any specific problem with nodetool drain not marking commitlog's flushed, but a safety to ensure data is not lost due to buggy nanotime implementations. For us, it was that the upgrade instructions pre-1.1.5->1.1.6 didn't mention that CL's should be removed if successfully drained. On the other hand, we do not use counters so replaying them was merely a much longer MTT-Return after restarting with 1.1.6. Mike -- Mike Heffner Librato, Inc.
Does a scrub remove deleted/expired columns?
I'm using 1.0.12 and I find that large sstables tend to get compacted infrequently. I've got data that gets deleted or expired frequently. Is it possible to use scrub to accelerate the clean up of expired/deleted data? -- Mike Smith Director Development, MailChannels
Re: Does a scrub remove deleted/expired columns?
Thanks for the great explanation. I'd just like some clarification on the last point. Is it the case that if I constantly add new columns to a row, while periodically trimming the row by by deleting the oldest columns, the deleted columns won't get cleaned up until all fragments of the row exist in a single sstable and that sstable undergoes a compaction? If my understanding is correct, do you know if 1.2 will enable cleanup of columns in rows that have scattered fragments? Or, should I take a different approach? On Thu, Dec 13, 2012 at 5:52 PM, aaron morton wrote: > Is it possible to use scrub to accelerate the clean up of expired/deleted > data? > > No. > Scrub, and upgradesstables, are used to re-write each file on disk. Scrub > may remove some rows from a file because of corruption, however > upgradesstables will not. > > If you have long lived rows and a mixed work load of writes and deletes > there are a couple of options. > > You can try levelled compaction > http://www.datastax.com/dev/blog/when-to-use-leveled-compaction > > You can tune the default sized tiered compaction by increasing the > min_compaction_threshold. This will increase the number of files that must > exist in each size tier before it will be compacted. As a result the speed > at which rows move into the higher tiers will slow down. > > Note that having lots of files may have a negative impact on read > performance. You can measure this my looking at the SSTables per read > metric in the cfhistograms. > > Lastly you can run a user defined or major compaction. User defined > compaction is available via JMX and allows you to compact any file you > want. Manual / major compaction is available via node tool. We usually > discourage it's use as it will create one big file that will not get > compacted for a while. > > > For background the tombstones / expired columns for a row are only purged > from the database when all fragments of the row are in the files been > compacted. So if you have an old row that is spread out over many files it > may not get purged. > > Hope that helps. > > > >- > Aaron Morton > Freelance Cassandra Developer > New Zealand > > @aaronmorton > http://www.thelastpickle.com > > On 14/12/2012, at 3:01 AM, Mike Smith wrote: > > I'm using 1.0.12 and I find that large sstables tend to get compacted > infrequently. I've got data that gets deleted or expired frequently. Is it > possible to use scrub to accelerate the clean up of expired/deleted data? > > -- > Mike Smith > Director Development, MailChannels > > > -- Mike Smith Director Development, MailChannels
CQL3 Blob Value Literal?
Does CQL3 support blob/BytesType literals for INSERT, UPDATE etc commands? I looked at the CQL3 syntax (http://cassandra.apache.org/doc/cql3/CQL.html) and at the DataStax 1.2 docs. As for why I'd want such a thing, I just wanted to initialize some test values for a blob column with cqlsh. Thanks!
Re: Node selection when both partition key and secondary index field constrained?
Thanks Aaron. So basically it's merging the results 2 separate queries: Indexed scan (token-range) intersect foo.flag_index=true where the latter query hits the entire cluster as per the secondary index FAQ entry. Thus the overall query would fail if LOCAL_QUORUM was requested, RF=3 and 2 nodes in a given replication group were down. Darn. Is there any way of efficiently getting around this (ie scope the query to just the nodes in the token range)? On Mon, Jan 28, 2013 at 11:44 AM, aaron morton wrote: > It uses the index... > > cqlsh:dev> tracing on; > Now tracing requests. > cqlsh:dev> > cqlsh:dev> > cqlsh:dev> SELECT id, flag from foo WHERE TOKEN(id) > '-9939393' AND > TOKEN(id) <= '0' AND flag=true; > > Tracing session: 128cab90-6982-11e2-8cd1-51eaa232562e > > activity | timestamp| > source| source_elapsed > > +--+---+ > execute_cql3_query | 08:36:55,244 | > 127.0.0.1 | 0 > Parsing statement | 08:36:55,244 | > 127.0.0.1 |600 > Peparing statement | 08:36:55,245 | > 127.0.0.1 | 1408 > Determining replicas to query | 08:36:55,246 | > 127.0.0.1 | 1924 > Executing indexed scan for (max(-9939393), max(0)] | 08:36:55,247 | > 127.0.0.1 | 2956 > Executing single-partition query on foo.flag_index | 08:36:55,247 | > 127.0.0.1 | 3192 >Acquiring sstable references | 08:36:55,247 | > 127.0.0.1 | 3220 > Merging memtable contents | 08:36:55,247 | > 127.0.0.1 | 3265 >Scanned 0 rows and matched 0 | 08:36:55,247 | > 127.0.0.1 | 3396 >Request complete | 08:36:55,247 | > 127.0.0.1 | 3644 > > > It reads from the secondary index and discards keys that are outside of > the token range. > > Cheers > > > - > Aaron Morton > Freelance Cassandra Developer > New Zealand > > @aaronmorton > http://www.thelastpickle.com > > On 28/01/2013, at 4:24 PM, Mike Sample wrote: > > > Does the following FAQ entry hold even when the partion key is also > constrained in the query (by token())? > > > > http://wiki.apache.org/cassandra/SecondaryIndexes: > > == > >Q: How does choice of Consistency Level affect cluster availability > when using secondary indexes? > > > >A: Because secondary indexes are distributed, you must have CL nodes > available for all token ranges in the cluster in order to complete a query. > For example, with RF = 3, when two out of three consecutive nodes in the > ring are unavailable, all secondary index queries at CL = QUORUM will fail, > however secondary index queries at CL = ONE will succeed. This is true > regardless of cluster size." > > == > > > > For example: > > > > CREATE TABLE foo ( > > id uuid, > > seq_num bigint, > > flag boolean, > > some_other_data blob, > > PRIMARY KEY (id,seq_num) > > ); > > > > CREATE INDEX flag_index ON foo (flag); > > > > SELECT id, flag from foo WHERE TOKEN(id) > '-9939393' AND TOKEN(id) <= > '0' AND flag=true; > > > > Would the above query with LOCAL_QUORUM succeed given the following? IE > is the token range used first trim node selection? > > > > * the cluster has 18 nodes > > * foo is in a keyspace with a replication factor of 3 for that data > center > > * 2 nodes in one of the replication groups are down > > * the token range in the query is not in the range of the down nodes > > > > > > Thanks in advance! > >
CQL3 PreparedStatement - parameterized timestamp
Is there a way to re-use a prepared statement with different "using timestamp" values? BEGIN BATCH USING INSERT INTO Foo (a,b,c) values (?,?,?) ... APPLY BATCH; Once bound or while binding the prepared statement to specific values, I'd like to set the timestamp value. Putting a question mark in for timestamp failed as expected and I don't see a method on the DataStax java driver BoundStatement for setting it. Thanks in advance. /Mike Sample
Re: CQL3 PreparedStatement - parameterized timestamp
Thanks Sylvain. I should have scanned Jira first. Glad to see it's on the todo list. On Wed, Feb 6, 2013 at 12:24 AM, Sylvain Lebresne wrote: > Not yet: https://issues.apache.org/jira/browse/CASSANDRA-4450 > > -- > Sylvain > > > On Wed, Feb 6, 2013 at 9:06 AM, Mike Sample wrote: > >> Is there a way to re-use a prepared statement with different "using >> timestamp" values? >> >> BEGIN BATCH USING >> INSERT INTO Foo (a,b,c) values (?,?,?) >> ... >> APPLY BATCH; >> >> Once bound or while binding the prepared statement to specific values, >> I'd like to set the timestamp value. >> >> Putting a question mark in for timestamp failed as expected and I don't >> see a method on the DataStax java driver BoundStatement for setting it. >> >> Thanks in advance. >> >> /Mike Sample >> > >
backing up and restoring from only 1 replica?
It has been suggested to me that we could save a fair amount of time and money by taking a snapshot of only 1 replica (so every third node for most column families). Assuming that we are okay with not having the absolute latest data, does this have any possibility of working? I feel like it shouldn't but don't really know the argument for why it wouldn't.
Re: backing up and restoring from only 1 replica?
Thanks for the response. Could you elaborate more on the bad things that happen during a restart or message drops that would cause a 1 replica restore to fail? I'm completely on board with not using a restore process that nobody else uses, but I need to convince somebody else who thinks that it will work that it is not a good idea. On 3/4/2013 7:54 AM, aaron morton wrote: That would be OK only if you never had node go down (e.g. a restart) or drop messages. It's not something I would consider trying. Cheers - Aaron Morton Freelance Cassandra Developer New Zealand @aaronmorton http://www.thelastpickle.com On 28/02/2013, at 3:21 PM, Mike Koh wrote: It has been suggested to me that we could save a fair amount of time and money by taking a snapshot of only 1 replica (so every third node for most column families). Assuming that we are okay with not having the absolute latest data, does this have any possibility of working? I feel like it shouldn't but don't really know the argument for why it wouldn't.
changing compaction strategy
I'm trying to change compaction strategy one node at a time. I'm using jmxterm like this: `echo 'set -b org.apache.cassandra.db:type=ColumnFamilies,keyspace=my_ks,columnfamily=my_cf CompactionParametersJson \{"class":"TimeWindowCompactionStrategy","compaction_window_unit":"HOURS","compaction_window_size":"6"\}' | java -jar jmxterm-1.0-alpha-4-uber.jar --url localhost:7199` and I see this in the cassandra logs: INFO [RMI TCP Connection(37)-127.0.0.1] 2017-03-13 20:29:08,251 CompactionStrategyManager.java:841 - Switching local compaction strategy from CompactionParams{class=org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy, options={max_threshold=32, min_threshold=4}} to CompactionParams{class=org.apache.cassandra.db.compaction.TimeWindowCompactionStrategy, options={min_threshold=4, max_threshold=32, compaction_window_unit=HOURS, compaction_window_size=6}}} After doing this, `nodetool compactionstats` shows 1 pending compaction, but none running. Also, cqlsh describe shows the old compaction strategy still. Am I missing a step?
Re: changing compaction strategy
Some more info: - running C* 3.9 - I tried `nodetool flush` on the column family this change applies to, and while it does seem to trigger compactions, there is still one pending that won't seem to run - I tried `nodetool compact` on the column family as well, with a similar affect Is there a way to tell when/if the local node has successfully updated the compaction strategy? Looking at the sstable files, it seems like they are still based on STCS but I don't know how to be sure. Appreciate any tips or suggestions! On Mon, Mar 13, 2017 at 5:30 PM, Mike Torra wrote: > I'm trying to change compaction strategy one node at a time. I'm using > jmxterm like this: > > `echo 'set -b > org.apache.cassandra.db:type=ColumnFamilies,keyspace=my_ks,columnfamily=my_cf > CompactionParametersJson \{"class":"TimeWindowCompactionStrategy", > "compaction_window_unit":"HOURS","compaction_window_size":"6"\}' | java > -jar jmxterm-1.0-alpha-4-uber.jar --url localhost:7199` > > and I see this in the cassandra logs: > > INFO [RMI TCP Connection(37)-127.0.0.1] 2017-03-13 20:29:08,251 > CompactionStrategyManager.java:841 - Switching local compaction strategy > from > CompactionParams{class=org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy, > options={max_threshold=32, min_threshold=4}} to CompactionParams{class=org. > apache.cassandra.db.compaction.TimeWindowCompactionStrategy, > options={min_threshold=4, max_threshold=32, compaction_window_unit=HOURS, > compaction_window_size=6}}} > > After doing this, `nodetool compactionstats` shows 1 pending compaction, > but none running. Also, cqlsh describe shows the old compaction strategy > still. Am I missing a step? >
sstableloader limitations in multi-dc cluster
I'm trying to use sstableloader to bulk load some data to my 4 DC cluster, and I can't quite get it to work. Here is how I'm trying to run it: sstableloader -d 127.0.0.1 -i {csv list of private ips of nodes in cluster} myks/mttest At first this seems to work, with a steady stream of logging like this (eventually getting to 100%): progress: [/10.0.1.225]0:13/13 100% [/10.0.0.134]0:13/13 100% [/10.0.0.119]0:13/13 100% [/10.0.1.26]0:13/13 100% [/10.0.3.188]0:13/13 100% [/10.0.3.189]0:13/13 100% [/10.0.2.95]0:13/13 100% total: 100% 0.000KiB/s (avg: 13.857MiB/s) There will be some errors sprinkled in like this: ERROR 15:35:43 [Stream #707f0920-5760-11e7-8ede-37de75ac1efa] Streaming error occurred on session with peer 10.0.2.9 java.net.NoRouteToHostException: No route to host Then, at the end, there will be one last warning about the failed streams: WARN 15:38:03 [Stream #707f0920-5760-11e7-8ede-37de75ac1efa] Stream failed Streaming to the following hosts failed: [/127.0.0.1, {list of same private ips as above}] I am perplexed about the failures because I am trying to explicitly ignore the nodes in remote DC's via the -i option to sstableloader. Why doesn't this work? I've tried using the public IP's instead just for kicks, but that doesn't change anything. I don't see anything helpful in the cassandra logs (including debug logs). Also, why is localhost in the list of failures? I can query the data locally after the sstableloader command completes. I've also noticed that sstableloader fails completely (even locally) while I am decomissioning or bootstrapping a node in a remote DC. Is this a limitation of sstableloader? I haven't been able to find documentation about this.
WG: How to sort result-set? / How to proper model a table?
Hey everyone, I'm new to Cassandra, going my first steps, having a problem/question regarding sorting results and proper data modelling. First of all, I read the article "We Shall Have Order!" by Aaron Ploetz (1) to get a first view on how Cassandra works. I reproduced the example in the article with my own table. DROP TABLE sensors; CREATE TABLE sensors ( timestamp BIGINT, name VARCHAR, value VARCHAR, unit VARCHAR, PRIMARY KEY (name, timestamp) ) WITH gc_grace_seconds = 0 AND CLUSTERING ORDER BY (timestamp DESC); I'm actual running Cassandra on a single node ([cqlsh 5.0.1 | Cassandra 3.11.0 | CQL spec 3.4.4 | Native protocol v4]). Now some background information about my project: I want to store all kinds of measuring-data from all kinds sensors. No matter if the sensor is measuring a temperature, water flow, or whatever. Sensors always give a single value. Interpretation has to be done afterwards by the user. So in my example, I 'm measuring temperatures of my house which leads me to the following data: timestamp name value unit 2017-07-24 14-11-00 entrance-a20Celsius 2017-07-24 14-11-04 living-room 24Celsius 2017-07-24 14-11-07 bath-room 22Celsius 2017-07-24 14-11-15 bed-room 23Celsius 2017-07-24 14-11-22 entrance-b20Celsius I'm measuring time-triggered each 15 minutes. In order to have some kind of start and end for each process, I decided to measure the entrance twice with different named sensors (entrance a and b). So above is one set of measuring-data, created by a single process. I'd say this is just another perfect example of what Aaron Ploetz describes in his article. When I query Cassandra the result set is not sorted by timestamp as long as I won't use the primary key in my WHERE clause. When I ask myself: "What will I query Cassandra for?" I'm always coming up with the same typical thoughts: * LIST all measuring's in a specific timespan ORDERED BY timestamp ASC/DESC o Requires ALLOW FILTERING o Won't be sorted * LIST all measuring's for a specific sensor ORDERED BY timestamp ASC/DESC o Sorted result. OK. * And stuff the future will bring which I simply don't know now. So in order to query Cassandra for measuring's in a specific timestamp I can't find a solid solution. My first idea was: * Add a column sequence which can be used to bundle a set of measuring's DROP TABLE sensors; CREATE TABLE sensors ( timestamp BIGINT, name VARCHAR, value VARCHAR, unit VARCHAR, sequence INT, PRIMARY KEY (sequence, timestamp) ) WITH gc_grace_seconds = 0 AND CLUSTERING ORDER BY (timestamp DESC); o I won't need to measure the entrance twice o I can query for a timespan as long as the timespan is within a sequence. ? But when I query a timespan containing more than a single sequence, then the result set is not correct sorted again sequence timestamp name value unit 123 2017-07-24 14-11-22 entrance-b20Celsius 123 2017-07-24 14-11-15 bed-room 23Celsius 123 2017-07-24 14-11-07 bath-room 22Celsius 123 2017-07-24 14-11-04 living-room 24Celsius 123 2017-07-24 14-11-00 entrance-a20Celsius 124 2017-07-24 15-11-22 entrance-b22Celsius 124 2017-07-24 15-11-15 bed-room 25Celsius 124 2017-07-24 15-11-07 bath-room 24Celsius 124 2017-07-24 15-11-04 living-room 26Celsius 124 2017-07-24 15-11-00 entrance-a22Celsius o Besides: it's not recommended to use a "dummy" column especially not as primary or clustering key. How to solve this problem? I believe, I can't be the only one who got this requirement. Imho "Sort it on the client-side" can't be the solution. As soon as data gets bigger we simply can't "just" sort on a client side. So my next idea was to use the table as overall data storage and create another table and periodically transfer data from the main to the child table. But I believe I'll get the same problem because Cassandra simply don't sort as an RDBMS. So here must be an idea behind the philosophy of Cassandra. Can anyone help me out? Best regards Mike Wenzel (1)https://www.datastax.com/dev/blog/we-shall-have-order
node restart causes application latency
Hi - I am running a 29 node cluster spread over 4 DC's in EC2, using C* 3.11.1 on Ubuntu. Occasionally I have the need to restart nodes in the cluster, but every time I do, I see errors and application (nodejs) timeouts. I restart a node like this: nodetool disablethrift && nodetool disablegossip && nodetool drain sudo service cassandra restart When I do that, I very often get timeouts and errors like this in my nodejs app: Error: Cannot achieve consistency level LOCAL_ONE My queries are all pretty much the same, things like: "select * from history where ts > {current_time}" The errors and timeouts seem to go away on their own after a while, but it is frustrating because I can't track down what I am doing wrong! I've tried waiting between steps of shutting down cassandra, and I've tried stopping, waiting, then starting the node. One thing I've noticed is that even after `nodetool drain`ing the node, there are open connections to other nodes in the cluster (ie looking at the output of netstat) until I stop cassandra. I don't see any errors or warnings in the logs. What can I do to prevent this? Is there something else I should be doing to gracefully restart the cluster? It could be something to do with the nodejs driver, but I can't find anything there to try. I appreciate any suggestions or advice. - Mike
Re: node restart causes application latency
Thanks for the feedback guys. That example data model was indeed abbreviated - the real queries have the partition key in them. I am using RF 3 on the keyspace, so I don't think a node being down would mean the key I'm looking for would be unavailable. The load balancing policy of the driver seems correct ( https://docs.datastax.com/en/developer/nodejs-driver/3.4/features/tuning-policies/#load-balancing-policy, and I am using the default `TokenAware` policy with `DCAwareRoundRobinPolicy` as a child), but I will look more closely at the implementation. It was an oversight of mine to not include `nodetool disablebinary`, but I still experience the same issue with that. One other thing I've noticed is that after restarting a node and seeing application latency, I also see that the node I just restarted sees many other nodes in the same DC as being down (ie status 'DN'). However, checking `nodetool status` on those other nodes shows all nodes as up/normal. To me this could kind of explain the problem - node comes back online, thinks it is healthy but many others are not, so it gets traffic from the client application. But then it gets requests for ranges that belong to a node it thinks is down, so it responds with an error. The latency issue seems to start roughly when the node goes down, but persists long (ie 15-20 mins) after it is back online and accepting connections. It seems to go away once the bounced node shows the other nodes in the same DC as up again. As for speculative retry, my CF is using the default of '99th percentile'. I could try something different there, but nodes being seen as down seems like an issue. On Tue, Feb 6, 2018 at 6:26 PM, Jeff Jirsa wrote: > Unless you abbreviated, your data model is questionable (SELECT without > any equality in the WHERE clause on the partition key will always cause a > range scan, which is super inefficient). Since you're doing LOCAL_ONE and a > range scan, timeouts sorta make sense - the owner of at least one range > would be down for a bit. > > If you actually have a partition key in your where clause, then the next > most likely guess is your clients aren't smart enough to route around the > node as it restarts, or your key cache is getting cold during the bounce. > Double check your driver's load balancing policy. > > It's also likely the case that speculative retry may help other nodes > route around the bouncing instance better - if you're not using it, you > probably should be (though with CL: LOCAL_ONE, it seems like it'd be less > of an issue). > > We need to make bouncing nodes easier (or rather, we need to make drain do > the right thing), but in this case, your data model looks like the biggest > culprit (unless it's an incomplete recreation). > > - Jeff > > > On Tue, Feb 6, 2018 at 10:58 AM, Mike Torra wrote: > >> Hi - >> >> I am running a 29 node cluster spread over 4 DC's in EC2, using C* 3.11.1 >> on Ubuntu. Occasionally I have the need to restart nodes in the cluster, >> but every time I do, I see errors and application (nodejs) timeouts. >> >> I restart a node like this: >> >> nodetool disablethrift && nodetool disablegossip && nodetool drain >> sudo service cassandra restart >> >> When I do that, I very often get timeouts and errors like this in my >> nodejs app: >> >> Error: Cannot achieve consistency level LOCAL_ONE >> >> My queries are all pretty much the same, things like: "select * from >> history where ts > {current_time}" >> >> The errors and timeouts seem to go away on their own after a while, but >> it is frustrating because I can't track down what I am doing wrong! >> >> I've tried waiting between steps of shutting down cassandra, and I've >> tried stopping, waiting, then starting the node. One thing I've noticed is >> that even after `nodetool drain`ing the node, there are open connections to >> other nodes in the cluster (ie looking at the output of netstat) until I >> stop cassandra. I don't see any errors or warnings in the logs. >> >> What can I do to prevent this? Is there something else I should be doing >> to gracefully restart the cluster? It could be something to do with the >> nodejs driver, but I can't find anything there to try. >> >> I appreciate any suggestions or advice. >> >> - Mike >> > >
Re: node restart causes application latency
No, I am not On Wed, Feb 7, 2018 at 11:35 AM, Jeff Jirsa wrote: > Are you using internode ssl? > > > -- > Jeff Jirsa > > > On Feb 7, 2018, at 8:24 AM, Mike Torra wrote: > > Thanks for the feedback guys. That example data model was indeed > abbreviated - the real queries have the partition key in them. I am using > RF 3 on the keyspace, so I don't think a node being down would mean the key > I'm looking for would be unavailable. The load balancing policy of the > driver seems correct (https://docs.datastax.com/en/ > developer/nodejs-driver/3.4/features/tuning-policies/# > load-balancing-policy, and I am using the default `TokenAware` policy > with `DCAwareRoundRobinPolicy` as a child), but I will look more closely at > the implementation. > > It was an oversight of mine to not include `nodetool disablebinary`, but I > still experience the same issue with that. > > One other thing I've noticed is that after restarting a node and seeing > application latency, I also see that the node I just restarted sees many > other nodes in the same DC as being down (ie status 'DN'). However, > checking `nodetool status` on those other nodes shows all nodes as > up/normal. To me this could kind of explain the problem - node comes back > online, thinks it is healthy but many others are not, so it gets traffic > from the client application. But then it gets requests for ranges that > belong to a node it thinks is down, so it responds with an error. The > latency issue seems to start roughly when the node goes down, but persists > long (ie 15-20 mins) after it is back online and accepting connections. It > seems to go away once the bounced node shows the other nodes in the same DC > as up again. > > As for speculative retry, my CF is using the default of '99th percentile'. > I could try something different there, but nodes being seen as down seems > like an issue. > > On Tue, Feb 6, 2018 at 6:26 PM, Jeff Jirsa wrote: > >> Unless you abbreviated, your data model is questionable (SELECT without >> any equality in the WHERE clause on the partition key will always cause a >> range scan, which is super inefficient). Since you're doing LOCAL_ONE and a >> range scan, timeouts sorta make sense - the owner of at least one range >> would be down for a bit. >> >> If you actually have a partition key in your where clause, then the next >> most likely guess is your clients aren't smart enough to route around the >> node as it restarts, or your key cache is getting cold during the bounce. >> Double check your driver's load balancing policy. >> >> It's also likely the case that speculative retry may help other nodes >> route around the bouncing instance better - if you're not using it, you >> probably should be (though with CL: LOCAL_ONE, it seems like it'd be less >> of an issue). >> >> We need to make bouncing nodes easier (or rather, we need to make drain >> do the right thing), but in this case, your data model looks like the >> biggest culprit (unless it's an incomplete recreation). >> >> - Jeff >> >> >> On Tue, Feb 6, 2018 at 10:58 AM, Mike Torra >> wrote: >> >>> Hi - >>> >>> I am running a 29 node cluster spread over 4 DC's in EC2, using C* >>> 3.11.1 on Ubuntu. Occasionally I have the need to restart nodes in the >>> cluster, but every time I do, I see errors and application (nodejs) >>> timeouts. >>> >>> I restart a node like this: >>> >>> nodetool disablethrift && nodetool disablegossip && nodetool drain >>> sudo service cassandra restart >>> >>> When I do that, I very often get timeouts and errors like this in my >>> nodejs app: >>> >>> Error: Cannot achieve consistency level LOCAL_ONE >>> >>> My queries are all pretty much the same, things like: "select * from >>> history where ts > {current_time}" >>> >>> The errors and timeouts seem to go away on their own after a while, but >>> it is frustrating because I can't track down what I am doing wrong! >>> >>> I've tried waiting between steps of shutting down cassandra, and I've >>> tried stopping, waiting, then starting the node. One thing I've noticed is >>> that even after `nodetool drain`ing the node, there are open connections to >>> other nodes in the cluster (ie looking at the output of netstat) until I >>> stop cassandra. I don't see any errors or warnings in the logs. >>> >>> What can I do to prevent this? Is there something else I should be doing >>> to gracefully restart the cluster? It could be something to do with the >>> nodejs driver, but I can't find anything there to try. >>> >>> I appreciate any suggestions or advice. >>> >>> - Mike >>> >> >> >
Re: node restart causes application latency
Any other ideas? If I simply stop the node, there is no latency problem, but once I start the node the problem appears. This happens consistently for all nodes in the cluster On Wed, Feb 7, 2018 at 11:36 AM, Mike Torra wrote: > No, I am not > > On Wed, Feb 7, 2018 at 11:35 AM, Jeff Jirsa wrote: > >> Are you using internode ssl? >> >> >> -- >> Jeff Jirsa >> >> >> On Feb 7, 2018, at 8:24 AM, Mike Torra wrote: >> >> Thanks for the feedback guys. That example data model was indeed >> abbreviated - the real queries have the partition key in them. I am using >> RF 3 on the keyspace, so I don't think a node being down would mean the key >> I'm looking for would be unavailable. The load balancing policy of the >> driver seems correct (https://docs.datastax.com/en/ >> developer/nodejs-driver/3.4/features/tuning-policies/#load- >> balancing-policy, and I am using the default `TokenAware` policy with >> `DCAwareRoundRobinPolicy` as a child), but I will look more closely at the >> implementation. >> >> It was an oversight of mine to not include `nodetool disablebinary`, but >> I still experience the same issue with that. >> >> One other thing I've noticed is that after restarting a node and seeing >> application latency, I also see that the node I just restarted sees many >> other nodes in the same DC as being down (ie status 'DN'). However, >> checking `nodetool status` on those other nodes shows all nodes as >> up/normal. To me this could kind of explain the problem - node comes back >> online, thinks it is healthy but many others are not, so it gets traffic >> from the client application. But then it gets requests for ranges that >> belong to a node it thinks is down, so it responds with an error. The >> latency issue seems to start roughly when the node goes down, but persists >> long (ie 15-20 mins) after it is back online and accepting connections. It >> seems to go away once the bounced node shows the other nodes in the same DC >> as up again. >> >> As for speculative retry, my CF is using the default of '99th >> percentile'. I could try something different there, but nodes being seen as >> down seems like an issue. >> >> On Tue, Feb 6, 2018 at 6:26 PM, Jeff Jirsa wrote: >> >>> Unless you abbreviated, your data model is questionable (SELECT without >>> any equality in the WHERE clause on the partition key will always cause a >>> range scan, which is super inefficient). Since you're doing LOCAL_ONE and a >>> range scan, timeouts sorta make sense - the owner of at least one range >>> would be down for a bit. >>> >>> If you actually have a partition key in your where clause, then the next >>> most likely guess is your clients aren't smart enough to route around the >>> node as it restarts, or your key cache is getting cold during the bounce. >>> Double check your driver's load balancing policy. >>> >>> It's also likely the case that speculative retry may help other nodes >>> route around the bouncing instance better - if you're not using it, you >>> probably should be (though with CL: LOCAL_ONE, it seems like it'd be less >>> of an issue). >>> >>> We need to make bouncing nodes easier (or rather, we need to make drain >>> do the right thing), but in this case, your data model looks like the >>> biggest culprit (unless it's an incomplete recreation). >>> >>> - Jeff >>> >>> >>> On Tue, Feb 6, 2018 at 10:58 AM, Mike Torra >>> wrote: >>> >>>> Hi - >>>> >>>> I am running a 29 node cluster spread over 4 DC's in EC2, using C* >>>> 3.11.1 on Ubuntu. Occasionally I have the need to restart nodes in the >>>> cluster, but every time I do, I see errors and application (nodejs) >>>> timeouts. >>>> >>>> I restart a node like this: >>>> >>>> nodetool disablethrift && nodetool disablegossip && nodetool drain >>>> sudo service cassandra restart >>>> >>>> When I do that, I very often get timeouts and errors like this in my >>>> nodejs app: >>>> >>>> Error: Cannot achieve consistency level LOCAL_ONE >>>> >>>> My queries are all pretty much the same, things like: "select * from >>>> history where ts > {current_time}" >>>> >>>> The errors and timeouts seem to go away on their own after a while, but >>>> it is frustrating because I can't track down what I am doing wrong! >>>> >>>> I've tried waiting between steps of shutting down cassandra, and I've >>>> tried stopping, waiting, then starting the node. One thing I've noticed is >>>> that even after `nodetool drain`ing the node, there are open connections to >>>> other nodes in the cluster (ie looking at the output of netstat) until I >>>> stop cassandra. I don't see any errors or warnings in the logs. >>>> >>>> What can I do to prevent this? Is there something else I should be >>>> doing to gracefully restart the cluster? It could be something to do with >>>> the nodejs driver, but I can't find anything there to try. >>>> >>>> I appreciate any suggestions or advice. >>>> >>>> - Mike >>>> >>> >>> >> >
Re: node restart causes application latency
Interestingly, it seems that changing the order of steps I take during the node restart resolves the problem. Instead of: `nodetool disablebinary && nodetool disablethrift && *nodetool disablegossip* && nodetool drain && sudo service cassandra restart`, if I do: `nodetool disablebinary && nodetool disablethrift && nodetool drain && *nodetool disablegossip* && sudo service cassandra restart`, I see no application errors, no latency, and no nodes marked as Down/Normal on the restarted node. Note the only thing I changed is that I moved `nodetool disablegossip` to after `nodetool drain`. This is pretty anecdotal, but is there any explanation for why this might happen? I'll be monitoring my cluster closely to see if this change does indeed fix the problem. On Mon, Feb 12, 2018 at 9:33 AM, Mike Torra wrote: > Any other ideas? If I simply stop the node, there is no latency problem, > but once I start the node the problem appears. This happens consistently > for all nodes in the cluster > > On Wed, Feb 7, 2018 at 11:36 AM, Mike Torra wrote: > >> No, I am not >> >> On Wed, Feb 7, 2018 at 11:35 AM, Jeff Jirsa wrote: >> >>> Are you using internode ssl? >>> >>> >>> -- >>> Jeff Jirsa >>> >>> >>> On Feb 7, 2018, at 8:24 AM, Mike Torra wrote: >>> >>> Thanks for the feedback guys. That example data model was indeed >>> abbreviated - the real queries have the partition key in them. I am using >>> RF 3 on the keyspace, so I don't think a node being down would mean the key >>> I'm looking for would be unavailable. The load balancing policy of the >>> driver seems correct (https://docs.datastax.com/en/ >>> developer/nodejs-driver/3.4/features/tuning-policies/#load-b >>> alancing-policy, and I am using the default `TokenAware` policy with >>> `DCAwareRoundRobinPolicy` as a child), but I will look more closely at the >>> implementation. >>> >>> It was an oversight of mine to not include `nodetool disablebinary`, but >>> I still experience the same issue with that. >>> >>> One other thing I've noticed is that after restarting a node and seeing >>> application latency, I also see that the node I just restarted sees many >>> other nodes in the same DC as being down (ie status 'DN'). However, >>> checking `nodetool status` on those other nodes shows all nodes as >>> up/normal. To me this could kind of explain the problem - node comes back >>> online, thinks it is healthy but many others are not, so it gets traffic >>> from the client application. But then it gets requests for ranges that >>> belong to a node it thinks is down, so it responds with an error. The >>> latency issue seems to start roughly when the node goes down, but persists >>> long (ie 15-20 mins) after it is back online and accepting connections. It >>> seems to go away once the bounced node shows the other nodes in the same DC >>> as up again. >>> >>> As for speculative retry, my CF is using the default of '99th >>> percentile'. I could try something different there, but nodes being seen as >>> down seems like an issue. >>> >>> On Tue, Feb 6, 2018 at 6:26 PM, Jeff Jirsa wrote: >>> >>>> Unless you abbreviated, your data model is questionable (SELECT without >>>> any equality in the WHERE clause on the partition key will always cause a >>>> range scan, which is super inefficient). Since you're doing LOCAL_ONE and a >>>> range scan, timeouts sorta make sense - the owner of at least one range >>>> would be down for a bit. >>>> >>>> If you actually have a partition key in your where clause, then the >>>> next most likely guess is your clients aren't smart enough to route around >>>> the node as it restarts, or your key cache is getting cold during the >>>> bounce. Double check your driver's load balancing policy. >>>> >>>> It's also likely the case that speculative retry may help other nodes >>>> route around the bouncing instance better - if you're not using it, you >>>> probably should be (though with CL: LOCAL_ONE, it seems like it'd be less >>>> of an issue). >>>> >>>> We need to make bouncing nodes easier (or rather, we need to make drain >>>> do the right thing), but in this case, your data model looks like the >>>> biggest culprit (unless it's an incomplete recreation). >>&
Re: node restart causes application latency
Then could it be that calling `nodetool drain` after calling `nodetool disablegossip` is what causes the problem? On Mon, Feb 12, 2018 at 6:12 PM, kurt greaves wrote: > > Actually, it's not really clear to me why disablebinary and thrift are > necessary prior to drain, because they happen in the same order during > drain anyway. It also really doesn't make sense that disabling gossip after > drain would make a difference here, because it should be already stopped. > This is all assuming drain isn't erroring out. >
Re: Slow bulk loading
It sounds as though you could be having troubles with Garbage Collection. Check your cassandra system logs and search for "GC". If you see frequent garbage collections taking more than a second or two to complete, you're going to need to do some configuration tweaking. On 05/07/2015 04:44 AM, Pierre Devops wrote: Hi, I m streaming a big sstable using bulk loader of sstableloader but it's very slow (3 Mbytes/sec) : Summary statistics: Connections per host: : 1 Total files transferred: : 1 Total bytes transferred: : 10357947484 Total duration (ms): : 3280229 Average transfer rate (MB/s): : 3 Peak transfer rate (MB/s):: 3 I'm on a single node configuration, empty keyspace and table, with good hardware 8x2.8ghz 32G RAM, dedicated to cassandra, so it's plenty of ressource for the process. I'm uploading from another server. The sstable is 9GB in size and have 4 partitions, but a lot of rows per partition (like 100 millions), the clustering key is a INT and have 4 other regulars columns, so approximatly 500 millions cells per ColumnFamily. When I upload I notice one core of the cassandra node is full CPU (all other cores are idleing), so I assume I'm CPU bound on node side. But why ? What the node is doing ? Why does it take so long time ? -- Mike Neir Liquid Web, Inc. Infrastructure Administrator
Counters 2.1 Accuracy
Hi All, I'm fairly new to Cassandra and am planning on using it as a datastore for an Apache Spark cluster. The use case is fairly simple, read the raw data and perform aggregates and push the rolled up data back to Cassandra. The data models will use counters pretty heavily so I'd like to understand what kind of accuracy should I expect from Cassandra 2.1 when increment the counters. - http://www.datastax.com/dev/blog/whats-new-in-cassandra-2-1-a-better-implementation-of-counters The blog post above states that the new counter implementations are "safer" although I'm not sure what that means in practice. Will the counters be 99.99% accurate? How often will they be over or under counted? Thanks, Mike.
making sense of output from Eclipse Memory Analyzer tool taken from .hprof file
I am investigating Java Out of memory heap errors. So I created an .hprof file and loaded it into Eclipse Memory Analyzer Tool which gave some "Problem Suspects". First one looks like: One instance of "org.apache.cassandra.db.ColumnFamilyStore" loaded by "sun.misc.Launcher$AppClassLoader @ 0x613e1bdc8" occupies 984,094,664 (11.64%) bytes. The memory is accumulated in one instance of "org.apache.cassandra.db.DataTracker$View" loaded by "sun.misc.Launcher$AppClassLoader @ 0x613e1bdc8". If I click around into the verbiage, I believe I can pick out the name of a column family but that is about it. Can someone explain what the above means in more detail and if it is indicative of a problem? Next one looks like: - •java.lang.Thread @ 0x73e1f74c8 CompactionExecutor:158 - 839,225,000 (9.92%) bytes. •java.lang.Thread @ 0x717f08178 MutationStage:31 - 809,909,192 (9.58%) bytes. •java.lang.Thread @ 0x717f082c8 MutationStage:5 - 649,667,472 (7.68%) bytes. •java.lang.Thread @ 0x717f083a8 MutationStage:21 - 498,081,544 (5.89%) bytes. •java.lang.Thread @ 0x71b357e70 MutationStage:11 - 444,931,288 (5.26%) bytes. -- If I click into the verbiage, they above Compaction and Mutations all seem to be referencing the same column family. Are the above related? Is there a way I can tell more exactly what is being compacted and/or mutated more specifically than which column family?
Re: Is there any open source software for automatized deploy C* in PRD?
Hi Boole, Have you tried chef? There is this cookbook for deploying cassandra: http://community.opscode.com/cookbooks/cassandra MikeA On 21 November 2013 01:33, Boole.Z.Guo (mis.cnsh04.Newegg) 41442 < boole.z@newegg.com> wrote: > Hi all, > > Is there any open source software for automatized deploy C* in PRD? > > > > Best Regards, > > *Boole Guo* > > *Software Engineer, NESC-SH.MIS* > > *+86-021-51530666 <%2B86-021-51530666>*41442* > > *Floor 19, KaiKai Plaza, 888, Wanhangdu Rd, Shanghai (200042)* > > *ONCE YOU KNOW, YOU NEWEGG.* > > *CONFIDENTIALITY NOTICE: This email and any files transmitted with it may > contain privileged or otherwise confidential information. It is intended > only for the person or persons to whom it is addressed. If you received > this message in error, you are not authorized to read, print, retain, copy, > disclose, disseminate, distribute, or use this message any part thereof or > any information contained therein. Please notify the sender immediately and > delete all copies of this message. Thank you in advance for your > cooperation.* > > > 保密注意:此邮件及其附随文件可能包含了保密信息。该邮件的目的是发送给指定收件人。如果您非指定收件人而错误地收到了本邮件,您将无权阅读、打印、保存、复制、泄露、传播、分发或使用此邮件全部或部分内容或者邮件中包含的任何信息。请立即通知发件人,并删除该邮件。感谢您的配合! > > >
How to restart bootstrap after a failed streaming due to Broken Pipe (1.2.16)
Hi, During an attempt to bootstrap a new node into a 1.2.16 ring the new node saw one of the streaming nodes periodically disappear: INFO [GossipTasks:1] 2014-06-10 00:28:52,572 Gossiper.java (line 823) InetAddress /10.156.1.2 is now DOWN ERROR [GossipTasks:1] 2014-06-10 00:28:52,574 AbstractStreamSession.java (line 108) Stream failed because /10.156.1.2 died or was restarted/removed (streams may still be active in background, but further streams won't be started) WARN [GossipTasks:1] 2014-06-10 00:28:52,574 RangeStreamer.java (line 246) Streaming from /10.156.1.2 failed INFO [HANDSHAKE-/10.156.1.2] 2014-06-10 00:28:57,922 OutboundTcpConnection.java (line 418) Handshaking version with /10.156.1.2 INFO [GossipStage:1] 2014-06-10 00:28:57,943 Gossiper.java (line 809) InetAddress /10.156.1.2 is now UP This brief interruption was enough to kill the streaming from node 10.156.1.2. Node 10.156.1.2 saw a similar "broken pipe" exception from the bootstrapping node: ERROR [Streaming to /10.156.193.1.3] 2014-06-10 01:22:02,345 CassandraDaemon.java (line 191) Exception in thread Thread[Streaming to / 10.156.1.3:1,5,main] java.lang.RuntimeException: java.io.IOException: Broken pipe at com.google.common.base.Throwables.propagate(Throwables.java:160) at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:32) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:724) Caused by: java.io.IOException: Broken pipe at sun.nio.ch.FileChannelImpl.transferTo0(Native Method) at sun.nio.ch.FileChannelImpl.transferToDirectly(FileChannelImpl.java:420) at sun.nio.ch.FileChannelImpl.transferTo(FileChannelImpl.java:552) at org.apache.cassandra.streaming.compress.CompressedFileStreamTask.stream(CompressedFileStreamTask.java:93) at org.apache.cassandra.streaming.FileStreamTask.runMayThrow(FileStreamTask.java:91) at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28) During bootstrapping we notice a significant spike in CPU and latency across the board on the ring (CPU 50->85% and write latencies 60ms -> 250ms). It seems likely that this persistent high load led to the hiccup that caused the gossiper to see the streaming node as briefly down. What is the proper way to recover from this? The original estimate was almost 24 hours to stream all the data required to bootstrap this single node (streaming set to unlimited) and this occurred 6 hours into the bootstrap. With such high load from streaming it seems that simply restarting will inevitably hit this problem again. Cheers, Mike -- Mike Heffner Librato, Inc.