Re: Read timeouts on primary key queries

Ryan Svihla Thu, 01 Sep 2016 06:10:30 -0700

Have you looked at cfhistograms/tablehistograms your data maybe just skewed 
(most likely explanation is probably the correct one here)

Regard,
Ryan Svihla

                _____________________________
From: Joseph Tech <[email protected]>
Sent: Wednesday, August 31, 2016 11:16 PM
Subject: Re: Read timeouts on primary key queries
To:  <[email protected]>

Patrick,
The desc table is below (only col names changed) : 
CREATE TABLE db.tbl (    id1 text,    id2 text,    id3 text,    id4 text,    f1 
text,    f2 map<text, text>,    f3 map<text, text>,    created timestamp,    
updated timestamp,    PRIMARY KEY (id1, id2, id3, id4)) WITH CLUSTERING ORDER 
BY (id2 ASC, id3 ASC, id4 ASC)    AND bloom_filter_fp_chance = 0.01    AND 
caching = '{"keys":"ALL", "rows_per_partition":"NONE"}'    AND comment = ''    
AND compaction = {'sstable_size_in_mb': '50', 'class': 
'org.apache.cassandra.db.compaction.LeveledCompactionStrategy'}    AND 
compression = {'sstable_compression': 
'org.apache.cassandra.io.compress.LZ4Compressor'}    AND 
dclocal_read_repair_chance = 0.0    AND default_time_to_live = 0    AND 
gc_grace_seconds = 864000    AND max_index_interval = 2048    AND 
memtable_flush_period_in_ms = 0    AND min_index_interval = 128    AND 
read_repair_chance = 0.1    AND speculative_retry = '99.0PERCENTILE';
and the query is select * from tbl where id1=? and id2=? and id3=? and id4=?
The timeouts happen within ~2s to ~5s, while the successful calls have avg of 
8ms and p99 of 15s. These times are seen from app side, the actual query times 
would be slightly lower. 
Is there a way to capture traces only when queries take longer than a specified 
duration? . We can't enable tracing in production given the volume of traffic. 
We see that the same query which timed out works fine later, so not sure if the 
trace of a successful run would help.
Thanks,Joseph

On Wed, Aug 31, 2016 at 8:05 PM, Patrick McFadin <[email protected]> wrote:
If you are getting a timeout on one table, then a mismatch of RF and node count 
doesn't seem as likely. 
Time to look at your query. You said it was a 'select * from table where key=?' 
type query. I would next use the trace facility in cqlsh to investigate 
further. That's a good way to find hard to find issues. You should be looking 
for clear ledge where you go from single digit ms to 4 or 5 digit ms times. 
The other place to look is your data model for that table if you want to post 
the output from a desc table.
Patrick

On Tue, Aug 30, 2016 at 11:07 AM, Joseph Tech <[email protected]> wrote:
On further analysis, this issue happens only on 1 table in the KS which has the 
max reads. 
@Atul, I will look at system health, but didnt see anything standing out from 
GC logs. (using JDK 1.8_92 with G1GC). 
@Patrick , could you please elaborate the "mismatch on node count + RF" part.
On Tue, Aug 30, 2016 at 5:35 PM, Atul Saroha <[email protected]> wrote:
There could be many reasons for this if it is intermittent. CPU usage + I/O 
wait status. As read are I/O intensive, your IOPS requirement should be met 
that time load. Heap issue if CPU is busy for GC only. Network health could be 
the reason. So better to look system health during that time when it comes.

---------------------------------------------------------------------------------------------------------------------
Atul Saroha
Lead Software Engineer
M: +91 8447784271 T: +91 124-415-6069 EXT: 12369
Plot # 362, ASF Centre - Tower A, Udyog Vihar,
 Phase -4, Sector 18, Gurgaon, Haryana 122016, INDIA
On Tue, Aug 30, 2016 at 5:10 PM, Joseph Tech <[email protected]> wrote:
Hi Patrick,
The nodetool status shows all nodes up and normal now. From OpsCenter "Event 
Log" , there are some nodes reported as being down/up etc. during the timeframe 
of timeout, but these are Search workload nodes from the remote (non-local) DC. 
The RF is 3 and there are 9 nodes per DC.
Thanks,Joseph
On Mon, Aug 29, 2016 at 11:07 PM, Patrick McFadin <[email protected]> wrote:
You aren't achieving quorum on your reads as the error is explains. That means 
you either have some nodes down or your topology is not matching up. The fact 
you are using LOCAL_QUORUM might point to a datacenter mis-match on node count 
+ RF. 
What does your nodetool status look like?
Patrick
On Mon, Aug 29, 2016 at 10:14 AM, Joseph Tech <[email protected]> wrote:
Hi,
We recently started getting intermittent timeouts on primary key queries 
(select * from table where key=<key>)
The error is : com.datastax.driver.core.exceptions.ReadTimeoutException: 
Cassandra timeout during read query at consistency LOCAL_QUORUM (2 responses 
were required but only 1 replica
a responded)
The same query would work fine when tried directly from cqlsh. There are no 
indications in system.log for the table in question, though there were 
compactions in progress for tables in another keyspace which is more frequently 
accessed. 
My understanding is that the chances of primary key queries timing out is very 
minimal. Please share the possible reasons / ways to debug this issue. 

We are using Cassandra 2.1 (DSE 4.8.7).
Thanks,Joseph

Re: Read timeouts on primary key queries

Reply via email to