Ad 4) For sure I got a big problem. Because pending tasks: 3094 The question is what should I change/monitor? I can present my whole solution design, if it helps
On Mon, Jan 12, 2015 at 8:32 PM, Ja Sam <ptrstp...@gmail.com> wrote: > To precise your remarks: > > 1) About 30 sec GC. I know that after time my cluster had such problem, we > added "magic" flag, but result will be in ~2 weeks (as I presented in > screen on StackOverflow). If you have any idea how can fix/diagnose this > problem, I will be very grateful. > > 2) It is probably true, but I don't think that I can change it. Our data > centers are in different places and the network between them is not > perfect. But as we observed network partition happened rare. Maximum is > once a week for an hour. > > 3) We are trying to do a regular repairs (incremental), but usually they > do not finish. Even local repairs have problems with finishing. > > 4) I will check it as soon as possible and post it here. If you have any > suggestion what else should I check, you are welcome :) > > > > > On Mon, Jan 12, 2015 at 7:28 PM, Eric Stevens <migh...@gmail.com> wrote: > >> If you're getting 30 second GC's, this all by itself could and probably >> does explain the problem. >> >> If you're writing exclusively to A, and there are frequent partitions >> between A and B, then A is potentially working a lot harder than B, because >> it needs to keep track of hinted handoffs to replay to B whenever >> connectivity is restored. It's also acting as coordinator for writes which >> need to end up in B eventually. This in turn may be a significant >> contributing factor to your GC pressure in A. >> >> I'd also grow suspicious of the integrity of B as a reliable backup of A >> unless you're running repair on a regular basis. >> >> Also, if you have thousands of SSTables, then you're probably falling >> behind on compaction, check nodetool compactionstats - you should typically >> have < 5 outstanding tasks (preferably 0-1). If you're not behind on >> compaction, your sstable_size_in_mb might be a bad value for your use case. >> >> On Mon, Jan 12, 2015 at 7:35 AM, Ja Sam <ptrstp...@gmail.com> wrote: >> >>> *Environment* >>> >>> >>> - Cassandra 2.1.0 >>> - 5 nodes in one DC (DC_A), 4 nodes in second DC (DC_B) >>> - 2500 writes per seconds, I write only to DC_A with local_quorum >>> - minimal reads (usually none, sometimes few) >>> >>> *Problem* >>> >>> After a few weeks of running I cannot read any data from my cluster, >>> because I have ReadTimeoutException like following: >>> >>> ERROR [Thrift:15] 2015-01-07 14:16:21,124 CustomTThreadPoolServer.java:219 >>> - Error occurred during processing of message. >>> com.google.common.util.concurrent.UncheckedExecutionException: >>> java.lang.RuntimeException: >>> org.apache.cassandra.exceptions.ReadTimeoutException: Operation timed out - >>> received only 2 responses. >>> >>> To be precise it is not only problem in my cluster, The second one was >>> described here: Cassandra GC takes 30 seconds and hangs node >>> <http://stackoverflow.com/questions/27843538/cassandra-gc-takes-30-seconds-and-hangs-node> >>> and >>> I will try to use fix from CASSANDRA-6541 >>> <http://issues.apache.org/jira/browse/CASSANDRA-6541> as leshkin >>> suggested >>> >>> *Diagnose * >>> >>> I tried to use some tools which were presented on >>> http://rustyrazorblade.com/2014/09/cassandra-summit-recap-diagnosing-problems-in-production/ >>> by Jon Haddad and have some strange result. >>> >>> >>> I tried to run same query in DC_A and DC_B with tracing enabled. Query >>> is simple: >>> >>> SELECT * FROM X.customer_events WHERE customer='1234567' AND >>> utc_day=16447 AND bucket IN (1,2,3,4,5,6,7,8,9,10); >>> >>> Where table is defiied as following: >>> >>> CREATE TABLE drev_maelstrom.customer_events (customer text,utc_day >>> int, bucket int, event_time bigint, event_id blob, event_type int, event >>> blob, >>> >>> PRIMARY KEY ((customer, utc_day, bucket), event_time, event_id, >>> event_type)[...] >>> >>> Results of the query: >>> >>> 1) In DC_B the query finished in less then a 0.22 of second . In DC_A >>> more then 2.5 (~10 times longer). -> the problem is that bucket can be in >>> range form -128 to 256 >>> >>> 2) In DC_B it checked ~1000 SSTables with lines like: >>> >>> Bloom filter allows skipping sstable 50372 [SharedPool-Worker-7] | >>> 2015-01-12 13:51:49.467001 | 192.168.71.198 | 4782 >>> >>> Where in DC_A it is: >>> >>> Bloom filter allows skipping sstable 118886 [SharedPool-Worker-5] | >>> 2015-01-12 14:01:39.520001 | 192.168.61.199 | 25527 >>> >>> 3) Total records in both DC were same. >>> >>> >>> *Question* >>> >>> The question is quite simple: how can I speed up DC_A - it is my primary >>> DC, DC_B is mostly for backup, and there is a lot of network partitions >>> between A and B. >>> >>> Maybe I should check something more, but I just don't have an idea what >>> it should be. >>> >>> >>> >> >