Re: Debugging write timeouts on Cassandra 2.2.5

Jai Bheemsen Rao Dhanwada Wed, 20 Sep 2017 09:50:14 -0700

Apologies for the typo, Mike

On Wed, Sep 20, 2017 at 9:49 AM, Jai Bheemsen Rao Dhanwada <
jaibheem...@gmail.com> wrote:


> Hello Nike,
>
> were you able to fix the issue? If so what change helped you?
>
> On Wed, Feb 24, 2016 at 5:36 PM, Jack Krupansky <jack.krupan...@gmail.com>
> wrote:
>
>> Great that you found a specific release that triggers the problem - 2.1.x
>> has a huge number of changes.
>>
>> How many partitions and rows do you have? What's the largest row count
>> for a single partition? And all of these CQL tables are COMPACT STORAGE,
>> correct? Are you writing a large number of skinny partitions or a smaller
>> number of very wide partitions? It wouldn't surprise me if behavior for
>> large partitions varies between releases since they can be so
>> memory-intensive.
>>
>> I see this change in 2.1.5 that could possibly introduce some memory
>> usage:
>> Write partition size estimates into a system table (CASSANDRA-7688)
>>
>> At this stage it would probably help for you to try to produce a
>> reasonably small repro test case that you could file as a Jira. And if you
>> could run that repro test case on 3.x to verify that the problem still
>> exists, that would be helpful as well.
>>
>> How long does it take to repro the timeout?
>>
>> Can you repro the timeout using a single node?
>>
>> What is the pattern of the timeouts - just random and occasional, or
>> heavy and continuous once they start?
>>
>> Are they occurring uniformly on all three nodes?
>>
>> If you bounce the cluster and continue testing, do the timeouts commence
>> immediately, fairly soon, or only after about as long as they take from a
>> clean fresh start?
>>
>>
>>
>> -- Jack Krupansky
>>
>> On Wed, Feb 24, 2016 at 7:04 PM, Mike Heffner <m...@librato.com> wrote:
>>
>>> Nate,
>>>
>>> So we have run several install tests, bisecting the 2.1.x release line,
>>> and we believe that the regression was introduced in version 2.1.5. This is
>>> the first release that clearly hits the timeout for us.
>>>
>>> It looks like quite a large release, so our next step will likely be
>>> bisecting the major commits to see if we can narrow it down:
>>> https://github.com/apache/cassandra/blob/3c0a337ebc90b
>>> 0d99349d0aa152c92b5b3494d8c/CHANGES.txt. Obviously, any suggestions on
>>> potential suspects appreciated.
>>>
>>> These are the memtable settings we've configured diff from the defaults
>>> during our testing:
>>>
>>> memtable_allocation_type: offheap_objects
>>> memtable_flush_writers: 8
>>>
>>>
>>> Cheers,
>>>
>>> Mike
>>>
>>> On Fri, Feb 19, 2016 at 1:46 PM, Nate McCall <n...@thelastpickle.com>
>>> wrote:
>>>
>>>> The biggest change which *might* explain your behavior has to do with
>>>> the changes in memtable flushing between 2.0 and 2.1:
>>>> https://issues.apache.org/jira/browse/CASSANDRA-5549
>>>>
>>>> However, the tpstats you posted shows no dropped mutations which would
>>>> make me more certain of this as the cause.
>>>>
>>>> What values do you have right now for each of these (my recommendations
>>>> for each on a c4.2xl with stock cassandra-env.sh are in parenthesis):
>>>>
>>>> - memtable_flush_writers (2)
>>>> - memtable_heap_space_in_mb  (2048)
>>>> - memtable_offheap_space_in_mb (2048)
>>>> - memtable_cleanup_threshold (0.11)
>>>> - memtable_allocation_type (offheap_objects)
>>>>
>>>> The biggest win IMO will be moving to offheap_objects. By default,
>>>> everything is on heap. Regardless, spending some time tuning these for your
>>>> workload will pay off.
>>>>
>>>> You may also want to be explicit about
>>>>
>>>> - native_transport_max_concurrent_connections
>>>> - native_transport_max_concurrent_connections_per_ip
>>>>
>>>> Depending on the driver, these may now be allowing 32k streams per
>>>> connection(!) as detailed in v3 of the native protocol:
>>>> https://github.com/apache/cassandra/blob/cassandra-2.1/doc/
>>>> native_protocol_v3.spec#L130-L152
>>>>
>>>>
>>>>
>>>> On Fri, Feb 19, 2016 at 8:48 AM, Mike Heffner <m...@librato.com> wrote:
>>>>
>>>>> Anuj,
>>>>>
>>>>> So we originally started testing with Java8 + G1, however we were able
>>>>> to reproduce the same results with the default CMS settings that ship in
>>>>> the cassandra-env.sh from the Deb pkg. We didn't detect any large GC 
>>>>> pauses
>>>>> during the runs.
>>>>>
>>>>> Query pattern during our testing was 100% writes, batching (via Thrift
>>>>> mostly) to 5 tables, between 6-1500 rows per batch.
>>>>>
>>>>> Mike
>>>>>
>>>>> On Thu, Feb 18, 2016 at 12:22 PM, Anuj Wadehra <anujw_2...@yahoo.co.in
>>>>> > wrote:
>>>>>
>>>>>> Whats the GC overhead? Can you your share your GC collector and
>>>>>> settings ?
>>>>>>
>>>>>>
>>>>>> Whats your query pattern? Do you use secondary indexes, batches, in
>>>>>> clause etc?
>>>>>>
>>>>>>
>>>>>> Anuj
>>>>>>
>>>>>>
>>>>>> Sent from Yahoo Mail on Android
>>>>>> <https://overview.mail.yahoo.com/mobile/?.src=Android>
>>>>>>
>>>>>> On Thu, 18 Feb, 2016 at 8:45 pm, Mike Heffner
>>>>>> <m...@librato.com> wrote:
>>>>>> Alain,
>>>>>>
>>>>>> Thanks for the suggestions.
>>>>>>
>>>>>> Sure, tpstats are here: https://gist.github.com/
>>>>>> mheffner/a979ae1a0304480b052a. Looking at the metrics across the
>>>>>> ring, there were no blocked tasks nor dropped messages.
>>>>>>
>>>>>> Iowait metrics look fine, so it doesn't appear to be blocking on
>>>>>> disk. Similarly, there are no long GC pauses.
>>>>>>
>>>>>> We haven't noticed latency on any particular table higher than others
>>>>>> or correlated around the occurrence of a timeout. We have noticed with
>>>>>> further testing that running cassandra-stress against the ring, while our
>>>>>> workload is writing to the same ring, will incur similar 10 second
>>>>>> timeouts. If our workload is not writing to the ring, cassandra stress 
>>>>>> will
>>>>>> run without hitting timeouts. This seems to imply that our workload 
>>>>>> pattern
>>>>>> is causing something to block cluster-wide, since the stress tool writes 
>>>>>> to
>>>>>> a different keyspace then our workload.
>>>>>>
>>>>>> I mentioned in another reply that we've tracked it to something
>>>>>> between 2.0.x and 2.1.x, so we are focusing on narrowing which point
>>>>>> release it was introduced in.
>>>>>>
>>>>>> Cheers,
>>>>>>
>>>>>> Mike
>>>>>>
>>>>>> On Thu, Feb 18, 2016 at 3:33 AM, Alain RODRIGUEZ <arodr...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi Mike,
>>>>>>>
>>>>>>> What about the output of tpstats ? I imagine you have dropped
>>>>>>> messages there. Any blocked threads ? Could you paste this output here ?
>>>>>>>
>>>>>>> May this be due to some network hiccup to access the disks as they
>>>>>>> are EBS ? Can you think of anyway of checking this ? Do you have a lot 
>>>>>>> of
>>>>>>> GC logs, how long are the pauses (use something like: grep -i 
>>>>>>> 'GCInspector'
>>>>>>> /var/log/cassandra/system.log) ?
>>>>>>>
>>>>>>> Something else you could check are local_writes stats to see if only
>>>>>>> one table if affected or this is keyspace / cluster wide. You can use
>>>>>>> metrics exposed by cassandra or if you have no dashboards I believe a:
>>>>>>> 'nodetool cfstats <myks> | grep -e 'Table:' -e 'Local'' should give you 
>>>>>>> a
>>>>>>> rough idea of local latencies.
>>>>>>>
>>>>>>> Those are just things I would check, I have not a clue on what is
>>>>>>> happening here, hope this will help.
>>>>>>>
>>>>>>> C*heers,
>>>>>>> -----------------
>>>>>>> Alain Rodriguez
>>>>>>> France
>>>>>>>
>>>>>>> The Last Pickle
>>>>>>> http://www.thelastpickle.com
>>>>>>>
>>>>>>> 2016-02-18 5:13 GMT+01:00 Mike Heffner <m...@librato.com>:
>>>>>>>
>>>>>>>> Jaydeep,
>>>>>>>>
>>>>>>>> No, we don't use any light weight transactions.
>>>>>>>>
>>>>>>>> Mike
>>>>>>>>
>>>>>>>> On Wed, Feb 17, 2016 at 6:44 PM, Jaydeep Chovatia <
>>>>>>>> chovatia.jayd...@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> Are you guys using light weight transactions in your write path?
>>>>>>>>>
>>>>>>>>> On Thu, Feb 11, 2016 at 12:36 AM, Fabrice Facorat <
>>>>>>>>> fabrice.faco...@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> Are your commitlog and data on the same disk ? If yes, you should
>>>>>>>>>> put
>>>>>>>>>> commitlogs on a separate disk which don't have a lot of IO.
>>>>>>>>>>
>>>>>>>>>> Others IO may have great impact impact on your commitlog writing
>>>>>>>>>> and
>>>>>>>>>> it may even block.
>>>>>>>>>>
>>>>>>>>>> An example of impact IO may have, even for Async writes:
>>>>>>>>>> https://engineering.linkedin.com/blog/2016/02/eliminating-la
>>>>>>>>>> rge-jvm-gc-pauses-caused-by-background-io-traffic
>>>>>>>>>>
>>>>>>>>>> 2016-02-11 0:31 GMT+01:00 Mike Heffner <m...@librato.com>:
>>>>>>>>>> > Jeff,
>>>>>>>>>> >
>>>>>>>>>> > We have both commitlog and data on a 4TB EBS with 10k IOPS.
>>>>>>>>>> >
>>>>>>>>>> > Mike
>>>>>>>>>> >
>>>>>>>>>> > On Wed, Feb 10, 2016 at 5:28 PM, Jeff Jirsa <
>>>>>>>>>> jeff.ji...@crowdstrike.com>
>>>>>>>>>> > wrote:
>>>>>>>>>> >>
>>>>>>>>>> >> What disk size are you using?
>>>>>>>>>> >>
>>>>>>>>>> >>
>>>>>>>>>> >>
>>>>>>>>>> >> From: Mike Heffner
>>>>>>>>>> >> Reply-To: "user@cassandra.apache.org"
>>>>>>>>>> >> Date: Wednesday, February 10, 2016 at 2:24 PM
>>>>>>>>>> >> To: "user@cassandra.apache.org"
>>>>>>>>>> >> Cc: Peter Norton
>>>>>>>>>> >> Subject: Re: Debugging write timeouts on Cassandra 2.2.5
>>>>>>>>>> >>
>>>>>>>>>> >> Paulo,
>>>>>>>>>> >>
>>>>>>>>>> >> Thanks for the suggestion, we ran some tests against CMS and
>>>>>>>>>> saw the same
>>>>>>>>>> >> timeouts. On that note though, we are going to try doubling
>>>>>>>>>> the instance
>>>>>>>>>> >> sizes and testing with double the heap (even though current
>>>>>>>>>> usage is low).
>>>>>>>>>> >>
>>>>>>>>>> >> Mike
>>>>>>>>>> >>
>>>>>>>>>> >> On Wed, Feb 10, 2016 at 3:40 PM, Paulo Motta <
>>>>>>>>>> pauloricard...@gmail.com>
>>>>>>>>>> >> wrote:
>>>>>>>>>> >>>
>>>>>>>>>> >>> Are you using the same GC settings as the staging 2.0
>>>>>>>>>> cluster? If not,
>>>>>>>>>> >>> could you try using the default GC settings (CMS) and see if
>>>>>>>>>> that changes
>>>>>>>>>> >>> anything? This is just a wild guess, but there were reports
>>>>>>>>>> before of
>>>>>>>>>> >>> G1-caused instabilities with small heap sizes (< 16GB - see
>>>>>>>>>> CASSANDRA-10403
>>>>>>>>>> >>> for more context). Please ignore if you already tried
>>>>>>>>>> reverting back to CMS.
>>>>>>>>>> >>>
>>>>>>>>>> >>> 2016-02-10 16:51 GMT-03:00 Mike Heffner <m...@librato.com>:
>>>>>>>>>> >>>>
>>>>>>>>>> >>>> Hi all,
>>>>>>>>>> >>>>
>>>>>>>>>> >>>> We've recently embarked on a project to update our Cassandra
>>>>>>>>>> >>>> infrastructure running on EC2. We are long time users of
>>>>>>>>>> 2.0.x and are
>>>>>>>>>> >>>> testing out a move to version 2.2.5 running on VPC with EBS.
>>>>>>>>>> Our test setup
>>>>>>>>>> >>>> is a 3 node, RF=3 cluster supporting a small write load
>>>>>>>>>> (mirror of our
>>>>>>>>>> >>>> staging load).
>>>>>>>>>> >>>>
>>>>>>>>>> >>>> We are writing at QUORUM and while p95's look good compared
>>>>>>>>>> to our
>>>>>>>>>> >>>> staging 2.0.x cluster, we are seeing frequent write
>>>>>>>>>> operations that time out
>>>>>>>>>> >>>> at the max write_request_timeout_in_ms (10 seconds). CPU
>>>>>>>>>> across the cluster
>>>>>>>>>> >>>> is < 10% and EBS write load is < 100 IOPS. Cassandra is
>>>>>>>>>> running with the
>>>>>>>>>> >>>> Oracle JDK 8u60 and we're using G1GC and any GC pauses are
>>>>>>>>>> less than 500ms.
>>>>>>>>>> >>>>
>>>>>>>>>> >>>> We run on c4.2xl instances with GP2 EBS attached storage for
>>>>>>>>>> data and
>>>>>>>>>> >>>> commitlog directories. The nodes are using EC2 enhanced
>>>>>>>>>> networking and have
>>>>>>>>>> >>>> the latest Intel network driver module. We are running on
>>>>>>>>>> HVM instances
>>>>>>>>>> >>>> using Ubuntu 14.04.2.
>>>>>>>>>> >>>>
>>>>>>>>>> >>>> Our schema is 5 tables, all with COMPACT STORAGE. Each table
>>>>>>>>>> is similar
>>>>>>>>>> >>>> to the definition here:
>>>>>>>>>> >>>> https://gist.github.com/mheffner/4d80f6b53ccaa24cc20a
>>>>>>>>>> >>>>
>>>>>>>>>> >>>> This is our cassandra.yaml:
>>>>>>>>>> >>>> https://gist.github.com/mheffner/fea80e6e939dd483f94f#file-
>>>>>>>>>> cassandra-yaml
>>>>>>>>>> >>>>
>>>>>>>>>> >>>> Like I mentioned we use 8u60 with G1GC and have used many of
>>>>>>>>>> the GC
>>>>>>>>>> >>>> settings in Al Tobey's tuning guide. This is our upstart
>>>>>>>>>> config with JVM and
>>>>>>>>>> >>>> other CPU settings: https://gist.github.com/mheffn
>>>>>>>>>> er/dc44613620b25c4fa46d
>>>>>>>>>> >>>>
>>>>>>>>>> >>>> We've used several of the sysctl settings from Al's guide as
>>>>>>>>>> well:
>>>>>>>>>> >>>> https://gist.github.com/mheffner/ea40d58f58a517028152
>>>>>>>>>> >>>>
>>>>>>>>>> >>>> Our client application is able to write using either Thrift
>>>>>>>>>> batches
>>>>>>>>>> >>>> using Asytanax driver or CQL async INSERT's using the
>>>>>>>>>> Datastax Java driver.
>>>>>>>>>> >>>>
>>>>>>>>>> >>>> For testing against Thrift (our legacy infra uses this) we
>>>>>>>>>> write batches
>>>>>>>>>> >>>> of anywhere from 6 to 1500 rows at a time. Our p99 for batch
>>>>>>>>>> execution is
>>>>>>>>>> >>>> around 45ms but our maximum (p100) sits less than 150ms
>>>>>>>>>> except when it
>>>>>>>>>> >>>> periodically spikes to the full 10seconds.
>>>>>>>>>> >>>>
>>>>>>>>>> >>>> Testing the same write path using CQL writes instead
>>>>>>>>>> demonstrates
>>>>>>>>>> >>>> similar behavior. Low p99s except for periodic full
>>>>>>>>>> timeouts. We enabled
>>>>>>>>>> >>>> tracing for several operations but were unable to get a
>>>>>>>>>> trace that completed
>>>>>>>>>> >>>> successfully -- Cassandra started logging many messages as:
>>>>>>>>>> >>>>
>>>>>>>>>> >>>> INFO  [ScheduledTasks:1] - MessagingService.java:946 -
>>>>>>>>>> _TRACE messages
>>>>>>>>>> >>>> were dropped in last 5000 ms: 52499 for internal timeout and
>>>>>>>>>> 0 for cross
>>>>>>>>>> >>>> node timeout
>>>>>>>>>> >>>>
>>>>>>>>>> >>>> And all the traces contained rows with a "null"
>>>>>>>>>> source_elapsed row:
>>>>>>>>>> >>>> https://gist.githubusercontent.com/mheffner/
>>>>>>>>>> 1d68a70449bd6688a010/raw/0327d7d3d94c3a93af02b64212e3b7e7d8f
>>>>>>>>>> 2911b/trace.out
>>>>>>>>>> >>>>
>>>>>>>>>> >>>>
>>>>>>>>>> >>>> We've exhausted as many configuration option permutations
>>>>>>>>>> that we can
>>>>>>>>>> >>>> think of. This cluster does not appear to be under any
>>>>>>>>>> significant load and
>>>>>>>>>> >>>> latencies seem to largely fall in two bands: low normal or
>>>>>>>>>> max timeout. This
>>>>>>>>>> >>>> seems to imply that something is getting stuck and timing
>>>>>>>>>> out at the max
>>>>>>>>>> >>>> write timeout.
>>>>>>>>>> >>>>
>>>>>>>>>> >>>> Any suggestions on what to look for? We had debug enabled
>>>>>>>>>> for awhile but
>>>>>>>>>> >>>> we didn't see any msg that pointed to something obvious.
>>>>>>>>>> Happy to provide
>>>>>>>>>> >>>> any more information that may help.
>>>>>>>>>> >>>>
>>>>>>>>>> >>>> We are pretty much at the point of sprinkling debug around
>>>>>>>>>> the code to
>>>>>>>>>> >>>> track down what could be blocking.
>>>>>>>>>> >>>>
>>>>>>>>>> >>>>
>>>>>>>>>> >>>> Thanks,
>>>>>>>>>> >>>>
>>>>>>>>>> >>>> Mike
>>>>>>>>>> >>>>
>>>>>>>>>> >>>> --
>>>>>>>>>> >>>>
>>>>>>>>>> >>>>   Mike Heffner <m...@librato.com>
>>>>>>>>>> >>>>   Librato, Inc.
>>>>>>>>>> >>>>
>>>>>>>>>> >>>
>>>>>>>>>> >>
>>>>>>>>>> >>
>>>>>>>>>> >>
>>>>>>>>>> >> --
>>>>>>>>>> >>
>>>>>>>>>> >>   Mike Heffner <m...@librato.com>
>>>>>>>>>> >>   Librato, Inc.
>>>>>>>>>> >>
>>>>>>>>>> >
>>>>>>>>>> >
>>>>>>>>>> >
>>>>>>>>>> > --
>>>>>>>>>> >
>>>>>>>>>> >   Mike Heffner <m...@librato.com>
>>>>>>>>>> >   Librato, Inc.
>>>>>>>>>> >
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Close the World, Open the Net
>>>>>>>>>> http://www.linux-wizard.net
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>>
>>>>>>>>   Mike Heffner <m...@librato.com>
>>>>>>>>   Librato, Inc.
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>>
>>>>>>   Mike Heffner <m...@librato.com>
>>>>>>   Librato, Inc.
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>>
>>>>>   Mike Heffner <m...@librato.com>
>>>>>   Librato, Inc.
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> -----------------
>>>> Nate McCall
>>>> Austin, TX
>>>> @zznate
>>>>
>>>> Co-Founder & Sr. Technical Consultant
>>>> Apache Cassandra Consulting
>>>> http://www.thelastpickle.com
>>>>
>>>
>>>
>>>
>>> --
>>>
>>>   Mike Heffner <m...@librato.com>
>>>   Librato, Inc.
>>>
>>>
>>
>

Re: Debugging write timeouts on Cassandra 2.2.5

Reply via email to