Jeff, Thanks, yeah we updated to the 2.16.4 driver version from source. I don't believe we've hit the bugs mentioned in earlier driver versions.
Mike On Mon, Jul 4, 2016 at 11:16 PM, Jeff Jirsa <jeff.ji...@crowdstrike.com> wrote: > AWS ubuntu 14.04 AMI ships with buggy enhanced networking driver – > depending on your instance types / hypervisor choice, you may want to > ensure you’re not seeing that bug. > > > > *From: *Mike Heffner <m...@librato.com> > *Reply-To: *"user@cassandra.apache.org" <user@cassandra.apache.org> > *Date: *Friday, July 1, 2016 at 1:10 PM > *To: *"user@cassandra.apache.org" <user@cassandra.apache.org> > *Cc: *Peter Norton <p...@librato.com> > *Subject: *Re: Ring connection timeouts with 2.2.6 > > > > Jens, > > > > We haven't noticed any particular large GC operations or even persistently > high GC times. > > > > Mike > > > > On Thu, Jun 30, 2016 at 3:20 AM, Jens Rantil <jens.ran...@tink.se> wrote: > > Hi, > > Could it be garbage collection occurring on nodes that are more heavily > loaded? > > Cheers, > Jens > > > > Den sön 26 juni 2016 05:22Mike Heffner <m...@librato.com> skrev: > > One thing to add, if we do a rolling restart of the ring the timeouts > disappear entirely for several hours and performance returns to normal. > It's as if something is leaking over time, but we haven't seen any > noticeable change in heap. > > > > On Thu, Jun 23, 2016 at 10:38 AM, Mike Heffner <m...@librato.com> wrote: > > Hi, > > > > We have a 12 node 2.2.6 ring running in AWS, single DC with RF=3, that is > sitting at <25% CPU, doing mostly writes, and not showing any particular > long GC times/pauses. By all observed metrics the ring is healthy and > performing well. > > > > However, we are noticing a pretty consistent number of connection timeouts > coming from the messaging service between various pairs of nodes in the > ring. The "Connection.TotalTimeouts" meter metric show 100k's of timeouts > per minute, usually between two pairs of nodes for several hours at a time. > It seems to occur for several hours at a time, then may stop or move to > other pairs of nodes in the ring. The metric > "Connection.SmallMessageDroppedTasks.<ip>" will also grow for one pair of > the nodes in the TotalTimeouts metric. > > > > Looking at the debug log typically shows a large number of messages like > the following on one of the nodes: > > > > StorageProxy.java:1033 - Skipped writing hint for /172.26.33.177 > <https://urldefense.proofpoint.com/v2/url?u=http-3A__172.26.33.177&d=CwMFaQ&c=08AGY6txKsvMOP6lYkHQpPMRA1U6kqhAwGa8-0QCg3M&r=yfYEBHVkX6l0zImlOIBID0gmhluYPD5Jje-3CtaT3ow&m=KlMh_-rpcOH2Mdf3i2XGCQhtU4ZuD0Y37WpHKGlKtnQ&s=ihxNa3DwQPrfqEURi_UIncjESJC_XexR_AjY81coG8U&e=> > (ttl 0) > > We have cross node timeouts enabled, but ntp is running on all nodes and > no node appears to have time drift. > > > > The network appears to be fine between nodes, with iperf tests showing > that we have a lot of headroom. > > > > Any thoughts on what to look for? Can we increase thread count/pool sizes > for the messaging service? > > > > Thanks, > > > > Mike > > > > -- > > > Mike Heffner <m...@librato.com> > > Librato, Inc. > > > > > > > > -- > > > Mike Heffner <m...@librato.com> > > Librato, Inc. > > > > -- > > Jens Rantil > Backend Developer @ Tink > > Tink AB, Wallingatan 5, 111 60 Stockholm, Sweden > For urgent matters you can reach me at +46-708-84 18 32. > > > > > > -- > > > Mike Heffner <m...@librato.com> > > Librato, Inc. > > > -- Mike Heffner <m...@librato.com> Librato, Inc.