Hi Anuj, Did you mean streaming_socket_timeout_in_ms? If not, then you definitely want that set. Even the best network connections will break occasionally, and in Cassandra < 2.1.10 (I believe) this would leave those connections hanging indefinitely on one end.
How far away are your two DC's from a network perspective, out of curiosity? You'll almost certainly be doing different TCP stack tuning for cross-DC, notably your buffer sizes, window params, cassandra-specific stuff like otc_coalescing_strategy, inter_dc_tcp_nodelay, etc. On Sat, Nov 14, 2015 at 10:35 AM, Anuj Wadehra <anujw_2...@yahoo.co.in> wrote: > One more observation.We observed that there are few TCP connections which > node shows as Established but when we go to node at other end,connection is > not there. They are called "phantom" connections I guess. Can this be a > possible cause? > > Thanks > Anuj > > Sent from Yahoo Mail on Android > <https://overview.mail.yahoo.com/mobile/?.src=Android> > ------------------------------ > *From*:"Anuj Wadehra" <anujw_2...@yahoo.co.in> > *Date*:Sat, 14 Nov, 2015 at 11:59 pm > > *Subject*:Re: Repair Hangs while requesting Merkle Trees > > Thanks Daemeon !! > > I wil capture the output of netstats and share in next few days. We were > thinking of taking tcp dumps also. If its a network issue and increasing > request timeout worked, not sure how Cassandra is dropping messages based > on timeout.Repair messages are non droppable and not supposed to be > timedout. > > 2 of the 3 nodes in the DC are able to complete repair without any issue. > Just one node is problematic. > > I also observed frequent messages in logs of other nodes which say that > hints replay timedout..and the node where hints were being replayed is > always a remote dc node. Is it related some how? > > Thanks > Anuj > > Sent from Yahoo Mail on Android > <https://overview.mail.yahoo.com/mobile/?.src=Android> > ------------------------------ > *From*:"daemeon reiydelle" <daeme...@gmail.com> > *Date*:Thu, 12 Nov, 2015 at 10:34 am > *Subject*:Re: Repair Hangs while requesting Merkle Trees > > > Have you checked the network statistics on that machine? (netstats -tas) > while attempting to repair ... if netstats show ANY issues you have a > problem. If you can put the command in a loop running every 60 seconds for > maybe 15 minutes and post back? > > Out of curiousity, how many remote DC nodes are getting successfully > repaired? > > > > *.......* > > > > > > > *“Life should not be a journey to the grave with the intention of arriving > safely in apretty and well preserved body, but rather to skid in broadside > in a cloud of smoke,thoroughly used up, totally worn out, and loudly > proclaiming “Wow! What a Ride!” - Hunter ThompsonDaemeon C.M. ReiydelleUSA > (+1) 415.501.0198 <%28%2B1%29%20415.501.0198>London (+44) (0) 20 8144 9872 > <%28%2B44%29%20%280%29%2020%208144%209872>* > > On Wed, Nov 11, 2015 at 1:06 PM, Anuj Wadehra <anujw_2...@yahoo.co.in> > wrote: > >> Hi, >> >> we are using 2.0.14. We have 2 DCs at remote locations with 10GBps >> connectivity.We are able to complete repair (-par -pr) on 5 nodes. On only >> one node in DC2, we are unable to complete repair as it always hangs. Node >> sends Merkle Tree requests, but one or more nodes in DC1 (remote) never >> show that they sent the merkle tree reply to requesting node. >> Repair hangs infinitely. >> >> After increasing request_timeout_in_ms on affected node, we were able to >> successfully run repair on one of the two occassions. >> >> Any comments, why this is happening on just one node? In >> OutboundTcpConnection.java, when isTimeOut method always returns false for >> non-droppable verb such as Merkle Tree Request(verb=REPAIR_MESSAGE),why >> increasing request timeout solved problem on one occasion ? >> >> >> Thanks >> Anuj Wadehra >> >> >> >> On Thursday, 12 November 2015 2:35 AM, Anuj Wadehra < >> anujw_2...@yahoo.co.in> wrote: >> >> >> Hi, >> >> We have 2 DCs at remote locations with 10GBps connectivity.We are able to >> complete repair (-par -pr) on 5 nodes. On only one node in DC2, we are >> unable to complete repair as it always hangs. Node sends Merkle Tree >> requests, but one or more nodes in DC1 (remote) never show that they sent >> the merkle tree reply to requesting node. >> Repair hangs infinitely. >> >> After increasing request_timeout_in_ms on affected node, we were able to >> successfully run repair on one of the two occassions. >> >> Any comments, why this is happening on just one node? In >> OutboundTcpConnection.java, when isTimeOut method always returns false for >> non-droppable verb such as Merkle Tree Request(verb=REPAIR_MESSAGE),why >> increasing request timeout solved problem on one occasion ? >> >> >> Thanks >> Anuj Wadehra >> >> >> >