I added the error logs and see that the timeouts are in a range b/n 2 to 7s. Samples below:
Query error after 5354 ms: [4 bound values] <query> Query error after 6658 ms: [4 bound values] <query> Query error after 4596 ms: [4 bound values] <query> Query error after 2068 ms: [4 bound values] <query> Query error after 2904 ms: [4 bound values] <query> There is no specific socket timeout set on the client side, so it would take the default of 12s. The read_request_timeout_in_ms is set to 5s. In this case, how do the errors happen in <5s ? . Is there any other factor that would cause a fail-fast scenario during the read? Thanks, Joseph On Wed, Sep 7, 2016 at 5:26 PM, Joseph Tech <jaalex.t...@gmail.com> wrote: > Thanks, Romain for the detailed explanation. We use log4j 2 and i have > added the driver logging for slow/error queries, will see if it helps to > provide any pattern once in Prod. > > I tried getendpoints and getsstables for some of the timed out keys and > most of them listed only 1 SSTable .There were a few which showed 2 > SSTables. There is no specific trend on the keys, it's completely based on > the user access, and the same keys return results instantly from cqlsh > > On Tue, Sep 6, 2016 at 1:57 PM, Romain Hardouin <romainh...@yahoo.fr> > wrote: > >> There is nothing special in the two sstablemetadata outuputs but if the >> timeouts are due to a network split or overwhelmed node or something like >> that you won't see anything here. That said, if you have the keys which >> produced the timeouts then, yes, you can look for a regular pattern (i.e. >> always the same keys?). >> >> You can find sstables for a given key with nodetool: >> nodetool getendpoints <keyspace> <cf> <key> >> Then you can run the following command on one/each node of the enpoints: >> nodetool getsstables <keyspace> <cf> <key> >> >> If many sstables are shown in the previous command it means that your >> data is fragmented but thanks to LCS this number should be low. >> >> I think the most usefull actions now would be: >> >> * 1) *Enable DEBUG for o.a.c.db.ConsistencyLevel, it won't spam your log >> file, you will see the following when errors will occur: >> - Local replicas [<endpoint1>, ...] are insufficient to satisfy >> LOCAL_QUORUM requirement of X live nodes in '<dc>' >> >> You are using C* 2.1 but you can have a look at the C* 2.2 >> logback.xml: https://github.com/apache/cassandra/blob/cassan >> dra-2.2/conf/logback.xml >> I'm using it on production, it's better because it creates a separate >> debug.log file with a asynchronous appender. >> >> Watch out when enabling: >> >> <appender-ref ref="ASYNCDEBUGLOG" /> >> >> Because the default logback configuration set all o.a.c in DEBUG: >> >> <logger name="org.apache.cassandra" level="DEBUG"/> >> >> Instead you can set: >> >> <logger name="org.apache.cassandra" level="INFO"/> >> <logger name="org.apache.cassandra.db.ConsistencyLevel" >> level="DEBUG"/> >> >> Also, if you want to restrict debug.log to DEBUG level only (instead >> of DEBUG+INFO+...) you can add a LevelFilter to ASYNCDEBUGLOG in >> logback.xml: >> >> <filter class="ch.qos.logback.classic.filter.LevelFilter"> >> <level>DEBUG</level> >> <onMatch>ACCEPT</onMatch> >> <onMismatch>DENY</onMismatch> >> </filter> >> >> Thus, the debug.log file will be empty unless some Consistency issues >> happen. >> >> * 2) *Enable slow queries log at the driver level with a QueryLogger: >> >> Cluster cluster = ... >> // log queries longer than 1 second, see also withDynamicThreshold >> QueryLogger queryLogger = QueryLogger.builder(cluster).w >> ithConstantThreshold(1000).build(); >> cluster.register(queryLogger); >> >> Then in your driver logback file: >> >> <logger name="com.datastax.driver.core.QueryLogger.SLOW" >> level="DEBUG" /> >> >> *3) *And/or: you mentioned that you use DSE so you can enable slow >> queries logging in dse.yaml (cql_slow_log_options) >> >> Best, >> >> Romain >> >> >> Le Lundi 5 septembre 2016 20h05, Joseph Tech <jaalex.t...@gmail.com> a >> écrit : >> >> >> Attached are the sstablemeta outputs from 2 SSTables of size 28 MB and 52 >> MB (out2). The records are inserted with different TTLs based on their >> nature ; test records with 1 day, typeA records with 6 months, typeB >> records with 1 year etc. There are also explicit DELETEs from this table, >> though it's much lower than the rate of inserts. >> >> I am not sure how to interpret this output, or if it's the right SSTables >> that were picked. Please advise. Is there a way to get the sstables >> corresponding to the keys that timed out, though they are accessible later. >> >> On Mon, Sep 5, 2016 at 10:58 PM, Anshu Vajpayee <anshu.vajpa...@gmail.com >> > wrote: >> >> We have seen read time out issue in cassandra due to high droppable >> tombstone ratio for repository. >> >> Please check for high droppable tombstone ratio for your repo. >> >> On Mon, Sep 5, 2016 at 8:11 PM, Romain Hardouin <romainh...@yahoo.fr> >> wrote: >> >> Yes dclocal_read_repair_chance will reduce the cross-DC traffic and >> latency, so you can swap the values ( https://issues.apache.org/ji >> ra/browse/CASSANDRA-7320 >> <https://issues.apache.org/jira/browse/CASSANDRA-7320> ). I guess the >> sstable_size_in_mb was set to 50 because back in the day (C* 1.0) the >> default size was way too small: 5 MB. So maybe someone in your company >> tried "10 * the default" i.e. 50 MB. Now the default is 160 MB. I don't say >> to change the value but just keep in mind that you're using a small value >> here, it could help you someday. >> >> Regarding the cells, the histograms shows an *estimation* of the min, >> p50, ..., p99, max of cells based on SSTables metadata. On your screenshot, >> the Max is 4768. So you have a partition key with ~ 4768 cells. The p99 is >> 1109, so 99% of your partition keys have less than (or equal to) 1109 >> cells. >> You can see these data of a given sstable with the tool sstablemetadata. >> >> Best, >> >> Romain >> >> >> >> Le Lundi 5 septembre 2016 15h17, Joseph Tech <jaalex.t...@gmail.com> a >> écrit : >> >> >> Thanks, Romain . We will try to enable the DEBUG logging (assuming it >> won't clog the logs much) . Regarding the table configs, read_repair_chance >> must be carried over from older versions - mostly defaults. I think >> sstable_size_in_mb >> was set to limit the max SSTable size, though i am not sure on the reason >> for the 50 MB value. >> >> Does setting dclocal_read_repair_chance help in reducing cross-DC >> traffic (haven't looked into this parameter, just going by the name). >> >> By the cell count definition : is it incremented based on the number of >> writes for a given name(key?) and value. This table is heavy on reads and >> writes. If so, the value should be much higher? >> >> On Mon, Sep 5, 2016 at 7:35 AM, Romain Hardouin <romainh...@yahoo.fr> >> wrote: >> >> Hi, >> >> Try to put org.apache.cassandra.db. ConsistencyLevel at DEBUG level, it >> could help to find a regular pattern. By the way, I see that you have set a >> global read repair chance: >> read_repair_chance = 0.1 >> And not the local read repair: >> dclocal_read_repair_chance = 0.0 >> Is there any reason to do that or is it just the old (pre 2.0.9) default >> configuration? >> >> The cell count is the number of triplets: (name, value, timestamp) >> >> Also, I see that you have set sstable_size_in_mb at 50 MB. What is the >> rational behind this? (Yes I'm curious :-) ). Anyway your "SSTables per >> read" are good. >> >> Best, >> >> Romain >> >> Le Lundi 5 septembre 2016 13h32, Joseph Tech <jaalex.t...@gmail.com> a >> écrit : >> >> >> Hi Ryan, >> >> Attached are the cfhistograms run within few mins of each other. On the >> surface, don't see anything which indicates too much skewing (assuming >> skewing ==keys spread across many SSTables) . Please confirm. Related to >> this, what does the "cell count" metric indicate ; didn't find a clear >> explanation in the documents. >> >> Thanks, >> Joseph >> >> >> On Thu, Sep 1, 2016 at 6:30 PM, Ryan Svihla <r...@foundev.pro> wrote: >> >> Have you looked at cfhistograms/tablehistograms your data maybe just >> skewed (most likely explanation is probably the correct one here) >> >> Regard, >> >> Ryan Svihla >> >> _____________________________ >> From: Joseph Tech <jaalex.t...@gmail.com> >> Sent: Wednesday, August 31, 2016 11:16 PM >> Subject: Re: Read timeouts on primary key queries >> To: <user@cassandra.apache.org> >> >> >> >> Patrick, >> >> The desc table is below (only col names changed) : >> >> CREATE TABLE db.tbl ( >> id1 text, >> id2 text, >> id3 text, >> id4 text, >> f1 text, >> f2 map<text, text>, >> f3 map<text, text>, >> created timestamp, >> updated timestamp, >> PRIMARY KEY (id1, id2, id3, id4) >> ) WITH CLUSTERING ORDER BY (id2 ASC, id3 ASC, id4 ASC) >> AND bloom_filter_fp_chance = 0.01 >> AND caching = '{"keys":"ALL", "rows_per_partition":"NONE"}' >> AND comment = '' >> AND compaction = {'sstable_size_in_mb': '50', 'class': >> 'org.apache.cassandra.db. compaction. LeveledCompactionStrategy'} >> AND compression = {'sstable_compression': 'org.apache.cassandra.io. >> compress.LZ4Compressor'} >> AND dclocal_read_repair_chance = 0.0 >> AND default_time_to_live = 0 >> AND gc_grace_seconds = 864000 >> AND max_index_interval = 2048 >> AND memtable_flush_period_in_ms = 0 >> AND min_index_interval = 128 >> AND read_repair_chance = 0.1 >> AND speculative_retry = '99.0PERCENTILE'; >> >> and the query is select * from tbl where id1=? and id2=? and id3=? and >> id4=? >> >> The timeouts happen within ~2s to ~5s, while the successful calls have >> avg of 8ms and p99 of 15s. These times are seen from app side, the actual >> query times would be slightly lower. >> >> Is there a way to capture traces only when queries take longer than a >> specified duration? . We can't enable tracing in production given the >> volume of traffic. We see that the same query which timed out works fine >> later, so not sure if the trace of a successful run would help. >> >> Thanks, >> Joseph >> >> >> On Wed, Aug 31, 2016 at 8:05 PM, Patrick McFadin <pmcfa...@gmail.com> >> wrote: >> >> If you are getting a timeout on one table, then a mismatch of RF and node >> count doesn't seem as likely. >> >> Time to look at your query. You said it was a 'select * from table where >> key=?' type query. I would next use the trace facility in cqlsh to >> investigate further. That's a good way to find hard to find issues. You >> should be looking for clear ledge where you go from single digit ms to 4 or >> 5 digit ms times. >> >> The other place to look is your data model for that table if you want to >> post the output from a desc table. >> >> Patrick >> >> >> >> On Tue, Aug 30, 2016 at 11:07 AM, Joseph Tech <jaalex.t...@gmail.com> >> wrote: >> >> On further analysis, this issue happens only on 1 table in the KS which >> has the max reads. >> >> @Atul, I will look at system health, but didnt see anything standing out >> from GC logs. (using JDK 1.8_92 with G1GC). >> >> @Patrick , could you please elaborate the "mismatch on node count + RF" >> part. >> >> On Tue, Aug 30, 2016 at 5:35 PM, Atul Saroha <atul.sar...@snapdeal.com> >> wrote: >> >> There could be many reasons for this if it is intermittent. CPU usage + >> I/O wait status. As read are I/O intensive, your IOPS requirement should be >> met that time load. Heap issue if CPU is busy for GC only. Network health >> could be the reason. So better to look system health during that time when >> it comes. >> >> ------------------------------ ------------------------------ >> ------------------------------ --------------------------- >> Atul Saroha >> *Lead Software Engineer* >> *M*: +91 8447784271 *T*: +91 124-415-6069 *EXT*: 12369 >> Plot # 362, ASF Centre - Tower A, Udyog Vihar, >> Phase -4, Sector 18, Gurgaon, Haryana 122016, INDIA >> >> On Tue, Aug 30, 2016 at 5:10 PM, Joseph Tech <jaalex.t...@gmail.com> >> wrote: >> >> Hi Patrick, >> >> The nodetool status shows all nodes up and normal now. From OpsCenter >> "Event Log" , there are some nodes reported as being down/up etc. during >> the timeframe of timeout, but these are Search workload nodes from the >> remote (non-local) DC. The RF is 3 and there are 9 nodes per DC. >> >> Thanks, >> Joseph >> >> On Mon, Aug 29, 2016 at 11:07 PM, Patrick McFadin <pmcfa...@gmail.com> >> wrote: >> >> You aren't achieving quorum on your reads as the error is explains. That >> means you either have some nodes down or your topology is not matching up. >> The fact you are using LOCAL_QUORUM might point to a datacenter mis-match >> on node count + RF. >> >> What does your nodetool status look like? >> >> Patrick >> >> On Mon, Aug 29, 2016 at 10:14 AM, Joseph Tech <jaalex.t...@gmail.com> >> wrote: >> >> Hi, >> >> We recently started getting intermittent timeouts on primary key queries >> (select * from table where key=<key>) >> >> The error is : com.datastax.driver.core.excep tions.ReadTimeoutException: >> Cassandra timeout during read query at consistency LOCAL_QUORUM (2 >> responses were required but only 1 replica >> a responded) >> >> The same query would work fine when tried directly from cqlsh. There are >> no indications in system.log for the table in question, though there were >> compactions in progress for tables in another keyspace which is more >> frequently accessed. >> >> My understanding is that the chances of primary key queries timing out is >> very minimal. Please share the possible reasons / ways to debug this issue. >> >> We are using Cassandra 2.1 (DSE 4.8.7). >> >> Thanks, >> Joseph >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> -- >> *Regards,* >> *Anshu * >> >> >> >> >> >> >