Patrick, The desc table is below (only col names changed) :
CREATE TABLE db.tbl ( id1 text, id2 text, id3 text, id4 text, f1 text, f2 map<text, text>, f3 map<text, text>, created timestamp, updated timestamp, PRIMARY KEY (id1, id2, id3, id4) ) WITH CLUSTERING ORDER BY (id2 ASC, id3 ASC, id4 ASC) AND bloom_filter_fp_chance = 0.01 AND caching = '{"keys":"ALL", "rows_per_partition":"NONE"}' AND comment = '' AND compaction = {'sstable_size_in_mb': '50', 'class': 'org.apache.cassandra.db.compaction.LeveledCompactionStrategy'} AND compression = {'sstable_compression': 'org.apache.cassandra.io.compress.LZ4Compressor'} AND dclocal_read_repair_chance = 0.0 AND default_time_to_live = 0 AND gc_grace_seconds = 864000 AND max_index_interval = 2048 AND memtable_flush_period_in_ms = 0 AND min_index_interval = 128 AND read_repair_chance = 0.1 AND speculative_retry = '99.0PERCENTILE'; and the query is select * from tbl where id1=? and id2=? and id3=? and id4=? The timeouts happen within ~2s to ~5s, while the successful calls have avg of 8ms and p99 of 15s. These times are seen from app side, the actual query times would be slightly lower. Is there a way to capture traces only when queries take longer than a specified duration? . We can't enable tracing in production given the volume of traffic. We see that the same query which timed out works fine later, so not sure if the trace of a successful run would help. Thanks, Joseph On Wed, Aug 31, 2016 at 8:05 PM, Patrick McFadin <pmcfa...@gmail.com> wrote: > If you are getting a timeout on one table, then a mismatch of RF and node > count doesn't seem as likely. > > Time to look at your query. You said it was a 'select * from table where > key=?' type query. I would next use the trace facility in cqlsh to > investigate further. That's a good way to find hard to find issues. You > should be looking for clear ledge where you go from single digit ms to 4 or > 5 digit ms times. > > The other place to look is your data model for that table if you want to > post the output from a desc table. > > Patrick > > > > On Tue, Aug 30, 2016 at 11:07 AM, Joseph Tech <jaalex.t...@gmail.com> > wrote: > >> On further analysis, this issue happens only on 1 table in the KS which >> has the max reads. >> >> @Atul, I will look at system health, but didnt see anything standing out >> from GC logs. (using JDK 1.8_92 with G1GC). >> >> @Patrick , could you please elaborate the "mismatch on node count + RF" >> part. >> >> On Tue, Aug 30, 2016 at 5:35 PM, Atul Saroha <atul.sar...@snapdeal.com> >> wrote: >> >>> There could be many reasons for this if it is intermittent. CPU usage + >>> I/O wait status. As read are I/O intensive, your IOPS requirement should be >>> met that time load. Heap issue if CPU is busy for GC only. Network health >>> could be the reason. So better to look system health during that time when >>> it comes. >>> >>> ------------------------------------------------------------ >>> --------------------------------------------------------- >>> Atul Saroha >>> *Lead Software Engineer* >>> *M*: +91 8447784271 *T*: +91 124-415-6069 *EXT*: 12369 >>> Plot # 362, ASF Centre - Tower A, Udyog Vihar, >>> Phase -4, Sector 18, Gurgaon, Haryana 122016, INDIA >>> >>> On Tue, Aug 30, 2016 at 5:10 PM, Joseph Tech <jaalex.t...@gmail.com> >>> wrote: >>> >>>> Hi Patrick, >>>> >>>> The nodetool status shows all nodes up and normal now. From OpsCenter >>>> "Event Log" , there are some nodes reported as being down/up etc. during >>>> the timeframe of timeout, but these are Search workload nodes from the >>>> remote (non-local) DC. The RF is 3 and there are 9 nodes per DC. >>>> >>>> Thanks, >>>> Joseph >>>> >>>> On Mon, Aug 29, 2016 at 11:07 PM, Patrick McFadin <pmcfa...@gmail.com> >>>> wrote: >>>> >>>>> You aren't achieving quorum on your reads as the error is explains. >>>>> That means you either have some nodes down or your topology is not >>>>> matching >>>>> up. The fact you are using LOCAL_QUORUM might point to a datacenter >>>>> mis-match on node count + RF. >>>>> >>>>> What does your nodetool status look like? >>>>> >>>>> Patrick >>>>> >>>>> On Mon, Aug 29, 2016 at 10:14 AM, Joseph Tech <jaalex.t...@gmail.com> >>>>> wrote: >>>>> >>>>>> Hi, >>>>>> >>>>>> We recently started getting intermittent timeouts on primary key >>>>>> queries (select * from table where key=<key>) >>>>>> >>>>>> The error is : com.datastax.driver.core.exceptions.ReadTimeoutException: >>>>>> Cassandra timeout during read query at consistency LOCAL_QUORUM (2 >>>>>> responses were required but only 1 replica >>>>>> a responded) >>>>>> >>>>>> The same query would work fine when tried directly from cqlsh. There >>>>>> are no indications in system.log for the table in question, though there >>>>>> were compactions in progress for tables in another keyspace which is more >>>>>> frequently accessed. >>>>>> >>>>>> My understanding is that the chances of primary key queries timing >>>>>> out is very minimal. Please share the possible reasons / ways to debug >>>>>> this >>>>>> issue. >>>>>> >>>>>> We are using Cassandra 2.1 (DSE 4.8.7). >>>>>> >>>>>> Thanks, >>>>>> Joseph >>>>>> >>>>>> >>>>>> >>>>>> >>>>> >>>> >>> >> >