Hi Ryan, Attached are the cfhistograms run within few mins of each other. On the surface, don't see anything which indicates too much skewing (assuming skewing ==keys spread across many SSTables) . Please confirm. Related to this, what does the "cell count" metric indicate ; didn't find a clear explanation in the documents.
Thanks, Joseph On Thu, Sep 1, 2016 at 6:30 PM, Ryan Svihla <r...@foundev.pro> wrote: > Have you looked at cfhistograms/tablehistograms your data maybe just > skewed (most likely explanation is probably the correct one here) > > Regard, > > Ryan Svihla > > _____________________________ > From: Joseph Tech <jaalex.t...@gmail.com> > Sent: Wednesday, August 31, 2016 11:16 PM > Subject: Re: Read timeouts on primary key queries > To: <user@cassandra.apache.org> > > > > Patrick, > > The desc table is below (only col names changed) : > > CREATE TABLE db.tbl ( > id1 text, > id2 text, > id3 text, > id4 text, > f1 text, > f2 map<text, text>, > f3 map<text, text>, > created timestamp, > updated timestamp, > PRIMARY KEY (id1, id2, id3, id4) > ) WITH CLUSTERING ORDER BY (id2 ASC, id3 ASC, id4 ASC) > AND bloom_filter_fp_chance = 0.01 > AND caching = '{"keys":"ALL", "rows_per_partition":"NONE"}' > AND comment = '' > AND compaction = {'sstable_size_in_mb': '50', 'class': > 'org.apache.cassandra.db.compaction.LeveledCompactionStrategy'} > AND compression = {'sstable_compression': 'org.apache.cassandra.io. > compress.LZ4Compressor'} > AND dclocal_read_repair_chance = 0.0 > AND default_time_to_live = 0 > AND gc_grace_seconds = 864000 > AND max_index_interval = 2048 > AND memtable_flush_period_in_ms = 0 > AND min_index_interval = 128 > AND read_repair_chance = 0.1 > AND speculative_retry = '99.0PERCENTILE'; > > and the query is select * from tbl where id1=? and id2=? and id3=? and > id4=? > > The timeouts happen within ~2s to ~5s, while the successful calls have avg > of 8ms and p99 of 15s. These times are seen from app side, the actual query > times would be slightly lower. > > Is there a way to capture traces only when queries take longer than a > specified duration? . We can't enable tracing in production given the > volume of traffic. We see that the same query which timed out works fine > later, so not sure if the trace of a successful run would help. > > Thanks, > Joseph > > > On Wed, Aug 31, 2016 at 8:05 PM, Patrick McFadin <pmcfa...@gmail.com> > wrote: > >> If you are getting a timeout on one table, then a mismatch of RF and node >> count doesn't seem as likely. >> >> Time to look at your query. You said it was a 'select * from table where >> key=?' type query. I would next use the trace facility in cqlsh to >> investigate further. That's a good way to find hard to find issues. You >> should be looking for clear ledge where you go from single digit ms to 4 or >> 5 digit ms times. >> >> The other place to look is your data model for that table if you want to >> post the output from a desc table. >> >> Patrick >> >> >> >> On Tue, Aug 30, 2016 at 11:07 AM, Joseph Tech <jaalex.t...@gmail.com> >> wrote: >> >>> On further analysis, this issue happens only on 1 table in the KS which >>> has the max reads. >>> >>> @Atul, I will look at system health, but didnt see anything standing out >>> from GC logs. (using JDK 1.8_92 with G1GC). >>> >>> @Patrick , could you please elaborate the "mismatch on node count + RF" >>> part. >>> >>> On Tue, Aug 30, 2016 at 5:35 PM, Atul Saroha <atul.sar...@snapdeal.com> >>> wrote: >>> >>>> There could be many reasons for this if it is intermittent. CPU usage + >>>> I/O wait status. As read are I/O intensive, your IOPS requirement should be >>>> met that time load. Heap issue if CPU is busy for GC only. Network health >>>> could be the reason. So better to look system health during that time when >>>> it comes. >>>> >>>> ------------------------------------------------------------ >>>> --------------------------------------------------------- >>>> Atul Saroha >>>> *Lead Software Engineer* >>>> *M*: +91 8447784271 *T*: +91 124-415-6069 *EXT*: 12369 >>>> Plot # 362, ASF Centre - Tower A, Udyog Vihar, >>>> Phase -4, Sector 18, Gurgaon, Haryana 122016, INDIA >>>> >>>> On Tue, Aug 30, 2016 at 5:10 PM, Joseph Tech <jaalex.t...@gmail.com> >>>> wrote: >>>> >>>>> Hi Patrick, >>>>> >>>>> The nodetool status shows all nodes up and normal now. From OpsCenter >>>>> "Event Log" , there are some nodes reported as being down/up etc. during >>>>> the timeframe of timeout, but these are Search workload nodes from the >>>>> remote (non-local) DC. The RF is 3 and there are 9 nodes per DC. >>>>> >>>>> Thanks, >>>>> Joseph >>>>> >>>>> On Mon, Aug 29, 2016 at 11:07 PM, Patrick McFadin <pmcfa...@gmail.com> >>>>> wrote: >>>>> >>>>>> You aren't achieving quorum on your reads as the error is explains. >>>>>> That means you either have some nodes down or your topology is not >>>>>> matching >>>>>> up. The fact you are using LOCAL_QUORUM might point to a datacenter >>>>>> mis-match on node count + RF. >>>>>> >>>>>> What does your nodetool status look like? >>>>>> >>>>>> Patrick >>>>>> >>>>>> On Mon, Aug 29, 2016 at 10:14 AM, Joseph Tech <jaalex.t...@gmail.com> >>>>>> wrote: >>>>>> >>>>>>> Hi, >>>>>>> >>>>>>> We recently started getting intermittent timeouts on primary key >>>>>>> queries (select * from table where key=<key>) >>>>>>> >>>>>>> The error is : com.datastax.driver.core.exceptions.ReadTimeoutException: >>>>>>> Cassandra timeout during read query at consistency LOCAL_QUORUM (2 >>>>>>> responses were required but only 1 replica >>>>>>> a responded) >>>>>>> >>>>>>> The same query would work fine when tried directly from cqlsh. There >>>>>>> are no indications in system.log for the table in question, though there >>>>>>> were compactions in progress for tables in another keyspace which is >>>>>>> more >>>>>>> frequently accessed. >>>>>>> >>>>>>> My understanding is that the chances of primary key queries timing >>>>>>> out is very minimal. Please share the possible reasons / ways to debug >>>>>>> this >>>>>>> issue. >>>>>>> >>>>>>> We are using Cassandra 2.1 (DSE 4.8.7). >>>>>>> >>>>>>> Thanks, >>>>>>> Joseph >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> > > >