If you have fallen far behind on compaction, this is a hard situation to recover from. It means that you're writing data faster than your cluster can absorb it. The right path forward depends on a lot of factors, but in general you either need more servers or bigger servers, or else you need to write less data.
Safely adding servers is actually hard in this situation, lots of aggressive compaction produces a result where bootstrapping new nodes (growing your cluster) causes a lot of over-streaming, meaning data that is getting compacted may be streamed multiple times, in the old SSTable, and again in the new post-compaction SSTable, and maybe again in another post-compaction SSTable. For a healthy cluster, it's a trivial amount of overstreaming. For an unhealthy cluster like this, you might not actually ever complete streaming and be able to successfully join the cluster before your target server's disks are full. If you can afford the space and don't already have things set up this way, disable compression and switch to size tiered compaction (you'll need to keep at least 50% of your disk space free to be safe in size tiered). Also nodetool setcompactionthroughput will let you open the flood gates to try to catch up on compaction quickly (at the cost of read and write performance into the cluster). If you still can't catch up on compaction, you have a very serious problem. You need to either reduce your write volume, or grow your cluster unsafely (disable bootstrapping new nodes) to reduce write pressure on your existing nodes. Either way you should get caught up on compaction before you can safely add new nodes again. If you grow unsafely, you are effectively electing to discard data. Some of it may be recoverable with a nodetool repair after you're caught up on compaction, but you will almost certainly lose some records. On Tue, Jan 13, 2015 at 2:22 AM, Ja Sam <ptrstp...@gmail.com> wrote: > Ad 4) For sure I got a big problem. Because pending tasks: 3094 > > The question is what should I change/monitor? I can present my whole > solution design, if it helps > > On Mon, Jan 12, 2015 at 8:32 PM, Ja Sam <ptrstp...@gmail.com> wrote: > >> To precise your remarks: >> >> 1) About 30 sec GC. I know that after time my cluster had such problem, >> we added "magic" flag, but result will be in ~2 weeks (as I presented in >> screen on StackOverflow). If you have any idea how can fix/diagnose this >> problem, I will be very grateful. >> >> 2) It is probably true, but I don't think that I can change it. Our data >> centers are in different places and the network between them is not >> perfect. But as we observed network partition happened rare. Maximum is >> once a week for an hour. >> >> 3) We are trying to do a regular repairs (incremental), but usually they >> do not finish. Even local repairs have problems with finishing. >> >> 4) I will check it as soon as possible and post it here. If you have any >> suggestion what else should I check, you are welcome :) >> >> >> >> >> On Mon, Jan 12, 2015 at 7:28 PM, Eric Stevens <migh...@gmail.com> wrote: >> >>> If you're getting 30 second GC's, this all by itself could and probably >>> does explain the problem. >>> >>> If you're writing exclusively to A, and there are frequent partitions >>> between A and B, then A is potentially working a lot harder than B, because >>> it needs to keep track of hinted handoffs to replay to B whenever >>> connectivity is restored. It's also acting as coordinator for writes which >>> need to end up in B eventually. This in turn may be a significant >>> contributing factor to your GC pressure in A. >>> >>> I'd also grow suspicious of the integrity of B as a reliable backup of A >>> unless you're running repair on a regular basis. >>> >>> Also, if you have thousands of SSTables, then you're probably falling >>> behind on compaction, check nodetool compactionstats - you should typically >>> have < 5 outstanding tasks (preferably 0-1). If you're not behind on >>> compaction, your sstable_size_in_mb might be a bad value for your use case. >>> >>> On Mon, Jan 12, 2015 at 7:35 AM, Ja Sam <ptrstp...@gmail.com> wrote: >>> >>>> *Environment* >>>> >>>> >>>> - Cassandra 2.1.0 >>>> - 5 nodes in one DC (DC_A), 4 nodes in second DC (DC_B) >>>> - 2500 writes per seconds, I write only to DC_A with local_quorum >>>> - minimal reads (usually none, sometimes few) >>>> >>>> *Problem* >>>> >>>> After a few weeks of running I cannot read any data from my cluster, >>>> because I have ReadTimeoutException like following: >>>> >>>> ERROR [Thrift:15] 2015-01-07 14:16:21,124 CustomTThreadPoolServer.java:219 >>>> - Error occurred during processing of message. >>>> com.google.common.util.concurrent.UncheckedExecutionException: >>>> java.lang.RuntimeException: >>>> org.apache.cassandra.exceptions.ReadTimeoutException: Operation timed out >>>> - received only 2 responses. >>>> >>>> To be precise it is not only problem in my cluster, The second one was >>>> described here: Cassandra GC takes 30 seconds and hangs node >>>> <http://stackoverflow.com/questions/27843538/cassandra-gc-takes-30-seconds-and-hangs-node> >>>> and >>>> I will try to use fix from CASSANDRA-6541 >>>> <http://issues.apache.org/jira/browse/CASSANDRA-6541> as leshkin >>>> suggested >>>> >>>> *Diagnose * >>>> >>>> I tried to use some tools which were presented on >>>> http://rustyrazorblade.com/2014/09/cassandra-summit-recap-diagnosing-problems-in-production/ >>>> by Jon Haddad and have some strange result. >>>> >>>> >>>> I tried to run same query in DC_A and DC_B with tracing enabled. Query >>>> is simple: >>>> >>>> SELECT * FROM X.customer_events WHERE customer='1234567' AND >>>> utc_day=16447 AND bucket IN (1,2,3,4,5,6,7,8,9,10); >>>> >>>> Where table is defiied as following: >>>> >>>> CREATE TABLE drev_maelstrom.customer_events (customer text,utc_day >>>> int, bucket int, event_time bigint, event_id blob, event_type int, event >>>> blob, >>>> >>>> PRIMARY KEY ((customer, utc_day, bucket), event_time, event_id, >>>> event_type)[...] >>>> >>>> Results of the query: >>>> >>>> 1) In DC_B the query finished in less then a 0.22 of second . In DC_A >>>> more then 2.5 (~10 times longer). -> the problem is that bucket can be in >>>> range form -128 to 256 >>>> >>>> 2) In DC_B it checked ~1000 SSTables with lines like: >>>> >>>> Bloom filter allows skipping sstable 50372 [SharedPool-Worker-7] | >>>> 2015-01-12 13:51:49.467001 | 192.168.71.198 | 4782 >>>> >>>> Where in DC_A it is: >>>> >>>> Bloom filter allows skipping sstable 118886 [SharedPool-Worker-5] | >>>> 2015-01-12 14:01:39.520001 | 192.168.61.199 | 25527 >>>> >>>> 3) Total records in both DC were same. >>>> >>>> >>>> *Question* >>>> >>>> The question is quite simple: how can I speed up DC_A - it is my >>>> primary DC, DC_B is mostly for backup, and there is a lot of network >>>> partitions between A and B. >>>> >>>> Maybe I should check something more, but I just don't have an idea what >>>> it should be. >>>> >>>> >>>> >>> >> >