Garo, No, we didn't notice any change in system load, just the expected spike in packet counts.
Mike On Wed, Jul 20, 2016 at 3:49 PM, Juho Mäkinen <juho.maki...@gmail.com> wrote: > Just to pick this up: Did you see any system load spikes? I'm tracing a > problem on 2.2.7 where my cluster sees load spikes up to 20-30, when the > normal average load is around 3-4. So far I haven't found any good reason, > but I'm going to try otc_coalescing_strategy: disabled tomorrow. > > - Garo > > On Fri, Jul 15, 2016 at 6:16 PM, Mike Heffner <m...@librato.com> wrote: > >> Just to followup on this post with a couple of more data points: >> >> 1) >> >> We upgraded to 2.2.7 and did not see any change in behavior. >> >> 2) >> >> However, what *has* fixed this issue for us was disabling msg coalescing >> by setting: >> >> otc_coalescing_strategy: DISABLED >> >> We were using the default setting before (time horizon I believe). >> >> We see periodic timeouts on the ring (once every few hours), but they are >> brief and don't impact latency. With msg coalescing turned on we would see >> these timeouts persist consistently after an initial spike. My guess is >> that something in the coalescing logic is disturbed by the initial timeout >> spike which leads to dropping all / high-percentage of all subsequent >> traffic. >> >> We are planning to continue production use with msg coaleasing disabled >> for now and may run tests in our staging environments to identify where the >> coalescing is breaking this. >> >> Mike >> >> On Tue, Jul 5, 2016 at 12:14 PM, Mike Heffner <m...@librato.com> wrote: >> >>> Jeff, >>> >>> Thanks, yeah we updated to the 2.16.4 driver version from source. I >>> don't believe we've hit the bugs mentioned in earlier driver versions. >>> >>> Mike >>> >>> On Mon, Jul 4, 2016 at 11:16 PM, Jeff Jirsa <jeff.ji...@crowdstrike.com> >>> wrote: >>> >>>> AWS ubuntu 14.04 AMI ships with buggy enhanced networking driver – >>>> depending on your instance types / hypervisor choice, you may want to >>>> ensure you’re not seeing that bug. >>>> >>>> >>>> >>>> *From: *Mike Heffner <m...@librato.com> >>>> *Reply-To: *"user@cassandra.apache.org" <user@cassandra.apache.org> >>>> *Date: *Friday, July 1, 2016 at 1:10 PM >>>> *To: *"user@cassandra.apache.org" <user@cassandra.apache.org> >>>> *Cc: *Peter Norton <p...@librato.com> >>>> *Subject: *Re: Ring connection timeouts with 2.2.6 >>>> >>>> >>>> >>>> Jens, >>>> >>>> >>>> >>>> We haven't noticed any particular large GC operations or even >>>> persistently high GC times. >>>> >>>> >>>> >>>> Mike >>>> >>>> >>>> >>>> On Thu, Jun 30, 2016 at 3:20 AM, Jens Rantil <jens.ran...@tink.se> >>>> wrote: >>>> >>>> Hi, >>>> >>>> Could it be garbage collection occurring on nodes that are more heavily >>>> loaded? >>>> >>>> Cheers, >>>> Jens >>>> >>>> >>>> >>>> Den sön 26 juni 2016 05:22Mike Heffner <m...@librato.com> skrev: >>>> >>>> One thing to add, if we do a rolling restart of the ring the timeouts >>>> disappear entirely for several hours and performance returns to normal. >>>> It's as if something is leaking over time, but we haven't seen any >>>> noticeable change in heap. >>>> >>>> >>>> >>>> On Thu, Jun 23, 2016 at 10:38 AM, Mike Heffner <m...@librato.com> >>>> wrote: >>>> >>>> Hi, >>>> >>>> >>>> >>>> We have a 12 node 2.2.6 ring running in AWS, single DC with RF=3, that >>>> is sitting at <25% CPU, doing mostly writes, and not showing any particular >>>> long GC times/pauses. By all observed metrics the ring is healthy and >>>> performing well. >>>> >>>> >>>> >>>> However, we are noticing a pretty consistent number of connection >>>> timeouts coming from the messaging service between various pairs of nodes >>>> in the ring. The "Connection.TotalTimeouts" meter metric show 100k's of >>>> timeouts per minute, usually between two pairs of nodes for several hours >>>> at a time. It seems to occur for several hours at a time, then may stop or >>>> move to other pairs of nodes in the ring. The metric >>>> "Connection.SmallMessageDroppedTasks.<ip>" will also grow for one pair of >>>> the nodes in the TotalTimeouts metric. >>>> >>>> >>>> >>>> Looking at the debug log typically shows a large number of messages >>>> like the following on one of the nodes: >>>> >>>> >>>> >>>> StorageProxy.java:1033 - Skipped writing hint for /172.26.33.177 >>>> <https://urldefense.proofpoint.com/v2/url?u=http-3A__172.26.33.177&d=CwMFaQ&c=08AGY6txKsvMOP6lYkHQpPMRA1U6kqhAwGa8-0QCg3M&r=yfYEBHVkX6l0zImlOIBID0gmhluYPD5Jje-3CtaT3ow&m=KlMh_-rpcOH2Mdf3i2XGCQhtU4ZuD0Y37WpHKGlKtnQ&s=ihxNa3DwQPrfqEURi_UIncjESJC_XexR_AjY81coG8U&e=> >>>> (ttl 0) >>>> >>>> We have cross node timeouts enabled, but ntp is running on all nodes >>>> and no node appears to have time drift. >>>> >>>> >>>> >>>> The network appears to be fine between nodes, with iperf tests showing >>>> that we have a lot of headroom. >>>> >>>> >>>> >>>> Any thoughts on what to look for? Can we increase thread count/pool >>>> sizes for the messaging service? >>>> >>>> >>>> >>>> Thanks, >>>> >>>> >>>> >>>> Mike >>>> >>>> >>>> >>>> -- >>>> >>>> >>>> Mike Heffner <m...@librato.com> >>>> >>>> Librato, Inc. >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> -- >>>> >>>> >>>> Mike Heffner <m...@librato.com> >>>> >>>> Librato, Inc. >>>> >>>> >>>> >>>> -- >>>> >>>> Jens Rantil >>>> Backend Developer @ Tink >>>> >>>> Tink AB, Wallingatan 5, 111 60 Stockholm, Sweden >>>> For urgent matters you can reach me at +46-708-84 18 32. >>>> >>>> >>>> >>>> >>>> >>>> -- >>>> >>>> >>>> Mike Heffner <m...@librato.com> >>>> >>>> Librato, Inc. >>>> >>>> >>>> >>> >>> >>> >>> -- >>> >>> Mike Heffner <m...@librato.com> >>> Librato, Inc. >>> >>> >> >> >> -- >> >> Mike Heffner <m...@librato.com> >> Librato, Inc. >> >> > -- Mike Heffner <m...@librato.com> Librato, Inc.