If you are getting a timeout on one table, then a mismatch of RF and node count doesn't seem as likely.
Time to look at your query. You said it was a 'select * from table where key=?' type query. I would next use the trace facility in cqlsh to investigate further. That's a good way to find hard to find issues. You should be looking for clear ledge where you go from single digit ms to 4 or 5 digit ms times. The other place to look is your data model for that table if you want to post the output from a desc table. Patrick On Tue, Aug 30, 2016 at 11:07 AM, Joseph Tech <[email protected]> wrote: > On further analysis, this issue happens only on 1 table in the KS which > has the max reads. > > @Atul, I will look at system health, but didnt see anything standing out > from GC logs. (using JDK 1.8_92 with G1GC). > > @Patrick , could you please elaborate the "mismatch on node count + RF" > part. > > On Tue, Aug 30, 2016 at 5:35 PM, Atul Saroha <[email protected]> > wrote: > >> There could be many reasons for this if it is intermittent. CPU usage + >> I/O wait status. As read are I/O intensive, your IOPS requirement should be >> met that time load. Heap issue if CPU is busy for GC only. Network health >> could be the reason. So better to look system health during that time when >> it comes. >> >> ------------------------------------------------------------ >> --------------------------------------------------------- >> Atul Saroha >> *Lead Software Engineer* >> *M*: +91 8447784271 *T*: +91 124-415-6069 *EXT*: 12369 >> Plot # 362, ASF Centre - Tower A, Udyog Vihar, >> Phase -4, Sector 18, Gurgaon, Haryana 122016, INDIA >> >> On Tue, Aug 30, 2016 at 5:10 PM, Joseph Tech <[email protected]> >> wrote: >> >>> Hi Patrick, >>> >>> The nodetool status shows all nodes up and normal now. From OpsCenter >>> "Event Log" , there are some nodes reported as being down/up etc. during >>> the timeframe of timeout, but these are Search workload nodes from the >>> remote (non-local) DC. The RF is 3 and there are 9 nodes per DC. >>> >>> Thanks, >>> Joseph >>> >>> On Mon, Aug 29, 2016 at 11:07 PM, Patrick McFadin <[email protected]> >>> wrote: >>> >>>> You aren't achieving quorum on your reads as the error is explains. >>>> That means you either have some nodes down or your topology is not matching >>>> up. The fact you are using LOCAL_QUORUM might point to a datacenter >>>> mis-match on node count + RF. >>>> >>>> What does your nodetool status look like? >>>> >>>> Patrick >>>> >>>> On Mon, Aug 29, 2016 at 10:14 AM, Joseph Tech <[email protected]> >>>> wrote: >>>> >>>>> Hi, >>>>> >>>>> We recently started getting intermittent timeouts on primary key >>>>> queries (select * from table where key=<key>) >>>>> >>>>> The error is : com.datastax.driver.core.exceptions.ReadTimeoutException: >>>>> Cassandra timeout during read query at consistency LOCAL_QUORUM (2 >>>>> responses were required but only 1 replica >>>>> a responded) >>>>> >>>>> The same query would work fine when tried directly from cqlsh. There >>>>> are no indications in system.log for the table in question, though there >>>>> were compactions in progress for tables in another keyspace which is more >>>>> frequently accessed. >>>>> >>>>> My understanding is that the chances of primary key queries timing out >>>>> is very minimal. Please share the possible reasons / ways to debug this >>>>> issue. >>>>> >>>>> We are using Cassandra 2.1 (DSE 4.8.7). >>>>> >>>>> Thanks, >>>>> Joseph >>>>> >>>>> >>>>> >>>>> >>>> >>> >> >
