RE: Trying to find the problem with a broken pipe

Tim Snyder Tue, 09 Aug 2011 05:29:41 -0700

Anthony,

All I can say is that the steps I outlined below fixed my problem and
allowed me to process 60 million rows of data.

Tim



-------- Original Message --------
Subject: Re: Trying to find the problem with a broken pipe
From: Anthony Ikeda <anthony.ikeda....@gmail.com>
Date: Tue, August 09, 2011 3:11 am
To: user@cassandra.apache.org

Tim do you know if this is the actual reason that is causing the broken
pipe? I'm having a hard time convincing my team that modifying this
value will fix the issue.

Jonathan, do you know if there is a valid explanation on why Tim no
longer has the problem based on this change?


Anthony Ikeda




On Thu, Aug 4, 2011 at 11:13 PM, Tim Snyder <t...@proinnovations.com>
wrote:
 I no longer get the error on the loader program. The steps I took to
fix
 it are increasing the thrift_max_message_length_in_mb msg length,
 stopping cassandra, blowing away the prior data store, and then
 restarting cassandra.
 
 Tim
 
 
 -------- Original Message --------
 Subject: Re: Trying to find the problem with a broken pipe
 

From: aaron morton <aa...@thelastpickle.com>
 Date: Fri, August 05, 2011 12:58 am
 To: user@cassandra.apache.org
 
 It's probably a network thing.
 
 The only thing I can think of in cassandra is
 thrift_max_message_length_in_mb in the config. That config setting will
 result in a TException thrown on the server side (i think), not sure if
 that makes the server kill the socket. I would hope the error returns
to
 the client.
 
 Perhaps check the server log.
 
 Cheers
 
 -----------------
 Aaron Morton
 Freelance Cassandra Developer
 @aaronmorton
 http://www.thelastpickle.com
 
 On 4 Aug 2011, at 23:05, Tim Snyder wrote:
 
 > I am getting the same problem (Broken Pipe) on a loader program,
after
 > about 8 million read, write pairs. I am pushing serialized objects
into
 > a column with the program, the object it seems to be doing it on is
much
 > larger than the prior objects, so I am wondering if it is possibly a
 > column size streaming issue through the thrift api? I am using
Cassandra
 > 0.8.0 and Hector 0.8.0-1
 >
 > Tim
 >
 > -------- Original Message --------
 > Subject: Re: Trying to find the problem with a broken pipe
 > From: Anthony Ikeda <anthony.ikeda....@gmail.com>
 > Date: Tue, August 02, 2011 10:43 pm
 > To: user@cassandra.apache.org
 >
 >> Very interesting. After the second host goes down do you see
 >> "me.prettyprint.hector.api.exceptions.HectorException: All host
pools
 >> marked down. Retry burden pushed out to client"?
 >
 > No, the last message is:
 > 2011-08-02 08:43:06,561 INFO
 > [me.prettyprint.cassandra.connection.HConnectionManager] - Client
 > CassandraClient<cassandradevrk2:9393-49> released to inactive or dead
 > pool. Closing.> Does your client recover after a period of time?
 >
 >
 >
 > The application seems to be fine for now but my concern is the
 > connection pooling as well - I mean do we have one pool or multiple?
 > I'll post to the Hector user group about the pooling because the
 > incident seems so isolated. We also have our infrastructure team
looking
 > into the communication between the application server and the
cassandra
 > nodes.
 >
 >
 > So far it's still a mystery.
 >
 >
 >
 >
 >
 > On Tue, Aug 2, 2011 at 1:25 PM, Jim Ancona <j...@anconafamily.com>
wrote:
 > On Tue, Aug 2, 2011 at 6:13 PM, Anthony Ikeda
 > <anthony.ikeda....@gmail.com> wrote:
 >
 >> The link (which I may be misreading)
 >> is
 >
http://groups.google.com/group/hector-users/browse_thread/thread/8d7004b6f85a0f2e
 >
 >
 > I hadn't found that one, but I doubt that our issue is related to
that.
 >
 >
 >> It's only started happening today and happened on 2 occassions (8:43
 > and
 >> 10:21) performing the same function (querying a column family).
 >> It seems to be trying to access a connection on one of the servers
 >> The client accesses the first node:
 >>
 >> 2011-08-02 08:43:06,541 ERROR
 >> [me.prettyprint.cassandra.connection.HThriftClient] - Could not
flush
 >> transport (to be expected if the pool is shutting down) in close for
 > client:
 >> CassandraClient<cassandradevrk1:9393-33>
 >> org.apache.thrift.transport.TTransportException:
 > java.net.SocketException:
 >> Broken pipe
 >>
 >> ...
 >> 2011-08-02 08:43:06,544 WARN
 >> [me.prettyprint.cassandra.connection.HConnectionManager] - Could not
 >> fullfill request on this host
 > CassandraClient<cassandradevrk1:9393-33>
 >>
 >> ...
 >>
 >> 2011-08-02 08:43:06,543 ERROR
 >> [me.prettyprint.cassandra.connection.HConnectionManager] - MARK HOST
 > AS DOWN
 >> TRIGGERED for host cassandradevrk1(10.130.202.34):9393
 >> 2011-08-02 08:43:06,543 ERROR
 >> [me.prettyprint.cassandra.connection.HConnectionManager] - Pool
state
 > on
 >> shutdown:
 >>
 >
<ConcurrentCassandraClientPoolByHost>:{cassandradevrk1(10.130.202.34):9393};
 >> IsActive?: true; Active: 1; Blocked: 0; Idle: 15;
NumBeforeExhausted:
 > 49
 >> 2011-08-02 08:43:06,543 ERROR
 >> [me.prettyprint.cassandra.connection.ConcurrentHClientPool] -
 > Shutdown
 >> triggered on
 >>
 >
<ConcurrentCassandraClientPoolByHost>:{cassandradevrk1(10.130.202.34):9393}
 >> 2011-08-02 08:43:06,544 ERROR
 >> [me.prettyprint.cassandra.connection.ConcurrentHClientPool] -
 > Shutdown
 >> complete on
 >>
 >
<ConcurrentCassandraClientPoolByHost>:{cassandradevrk1(10.130.202.34):9393}
 >> 2011-08-02 08:43:06,544 INFO
 >> [me.prettyprint.cassandra.connection.CassandraHostRetryService] -
 > Host
 >> detected as down was added to retry queue:
 >> cassandradevrk1(10.130.202.34):9393
 >> 2011-08-02 08:43:06,544 WARN
 >> [me.prettyprint.cassandra.connection.HConnectionManager] - Could not
 >> fullfill request on this host CassandraClient<
 >> cassandradevrk1:9393-33>
 >> 2011-08-02 08:43:06,544 WARN
 >> [me.prettyprint.cassandra.connection.HConnectionManager] -
Exception:
 >> me.prettyprint.hector.api.exceptions.HectorTransportException:
 >> org.apache.thrift.transport.TTransportException:
 > java.net.SocketException:
 >> Connection reset
 >>
 >>
 >> Then it appears to try the second node and fails:
 >>
 >> 2011-08-02 08:43:06,556 INFO
 >> [me.prettyprint.cassandra.connection.HConnectionManager] - Client
 >> CassandraClient<cassandradevrk1:9393-33> released to inactive or
dead
 > pool.
 >> Closing.
 >> 2011-08-02 08:43:06,557 ERROR
 >> [me.prettyprint.cassandra.connection.HThriftClient] - Could not
flush
 >> transport (to be expected if the pool is shutting down) in close for
 > client:
 >> CassandraClient<cassandradevrk2:9393-49>
 >>
 >> org.apache.thrift.transport.TTransportException:
 > java.net.SocketException:
 >> Broken pipe
 >>
 >> 2011-08-02 08:43:06,558 ERROR
 >> [me.prettyprint.cassandra.connection.HConnectionManager] - MARK HOST
 > AS DOWN
 >> TRIGGERED for host cassandradevrk2(10.130.202.35):9393
 >> 2011-08-02 08:43:06,559 ERROR
 >> [me.prettyprint.cassandra.connection.HConnectionManager] - Pool
state
 > on
 >> shutdown:
 >>
 >
<ConcurrentCassandraClientPoolByHost>:{cassandradevrk2(10.130.202.35):9393};
 >> IsActive?: true; Active: 1; Blocked: 0; Idle: 15;
NumBeforeExhausted:
 > 49
 >> 2011-08-02 08:43:06,559 ERROR
 >> [me.prettyprint.cassandra.connection.ConcurrentHClientPool] -
 > Shutdown
 >> triggered on
 >>
 >
<ConcurrentCassandraClientPoolByHost>:{cassandradevrk2(10.130.202.35):9393}
 >> 2011-08-02 08:43:06,559 ERROR
 >> [me.prettyprint.cassandra.connection.ConcurrentHClientPool] -
 > Shutdown
 >> complete on
 >>
 >
<ConcurrentCassandraClientPoolByHost>:{cassandradevrk2(10.130.202.35):9393}
 >> 2011-08-02 08:43:06,559 INFO
 >> [me.prettyprint.cassandra.connection.CassandraHostRetryService] -
 > Host
 >> detected as down was added to retry queue:
 >> cassandradevrk2(10.130.202.35):9393
 >> 2011-08-02 08:43:06,560 WARN
 >> [me.prettyprint.cassandra.connection.HConnectionManager] - Could not
 >> fullfill request on this host
 > CassandraClient<cassandradevrk2:9393-49>
 >> 2011-08-02 08:43:06,560 WARN
 >> [me.prettyprint.cassandra.connection.HConnectionManager] -
Exception:
 >> me.prettyprint.hector.api.exceptions.HectorTransportException:
 >> org.apache.thrift.transport.TTransportException:
 > java.net.SocketException:
 >> Connection reset
 >
 >
 >
 > Very interesting. After the second host goes down do you see
 > "me.prettyprint.hector.api.exceptions.HectorException: All host pools
 > marked down. Retry burden pushed out to client"?
 >
 > Does your client recover after a period of time?
 >
 >>
 >> The process is the same at 10:21.
 >> Are the exceptions related to any external events (e.g. node
 >> restarts, network issues...)?
 >> Not that I'm aware, unless there are firewall timeouts between the
 >> application and the node servers. Let me find out. The cassandra log
 > files
 >> have no errors reported.
 >> What versions of Hector and Cassandra are you running?
 >> Cassandra 0.8.1, Hector 0.8.0-1
 >
 >
 > Our issue is occurring with Cassandra 0.7.8 and Hector 0.7-30. We
plan
 > to deploy Hector 0.7-31 this week and to turn on useSocketKeepalive.
 > Are you using that? We're also using tcpdump to capture packets when
 > failures occur to see if there are anomalies in the network traffic.
 >
 > Jim
 >
 >
 >
 >>
 >>
 >>
 >> On Tue, Aug 2, 2011 at 10:37 AM, Jim Ancona <j...@anconafamily.com>
 > wrote:
 >>>
 >>> On Tue, Aug 2, 2011 at 4:36 PM, Anthony Ikeda
 >>> <anthony.ikeda....@gmail.com> wrote:
 >>>> I'm not sure if this is a problem with Hector or with Cassandra.
 >>>> We seem to be seeing broken pipe issues with our connections on
 > the
 >>>> client
 >>>> side (Exception below). A bit of googling finds possibly a problem
 > with
 >>>> the
 >>>> amount of data we are trying to store, although I'm certain our
 > datasets
 >>>> are
 >>>> not all that large.
 >>>
 >>> I'm not sure what you're referring to here. Large requests could
 > lead
 >>> to timeouts, but that's not what you're seeing here. Could you link
 > to
 >>> the page you're referencing?
 >>>
 >>>> A nodetool ring command doesn't seem to present any downed nodes:
 >>>> Address DC Rack Status State Load
 >>>> Owns
 >>>> Token
 >>>>
 >>>> 153951716904446304929228999025275230571
 >>>> 10.130.202.34 datacenter1 rack1 Up Normal 470.74 KB
 >>>> 79.19% 118538200848404459763384037192174096102
 >>>> 10.130.202.35 datacenter1 rack1 Up Normal 483.63 KB
 >>>> 20.81% 153951716904446304929228999025275230571
 >>>>
 >>>> There are no errors in the cassandra server logs.
 >>>>
 >>>> Are there any particular timeouts on connections that we need to
 > be
 >>>> aware
 >>>> of? Or perhaps configure on the Cassandra nodes? Is this purely
 > and
 >>>> issue
 >>>> with the Hector API configuration?
 >>>
 >>> There is a server side timeout (rpc_timeout_in_ms in
cassandra.yaml)
 >>> and a Hector client-side timeout
 >>> (CassandraHostConfigurator.cassandraThriftSocketTimeout). But
again,
 >>> the "Broken pipe" error is not a timeout, it indicates that
 > something
 >>> happened to the underlying network socket. For example you will see
 >>> those when a server node is restarted.
 >>>
 >>> Some questions that might help troubleshoot this:
 >>> How often are these occurring?
 >>> Does this affect both nodes in the cluster or just one?
 >>> Are the exceptions related to any external events (e.g. node
 > restarts,
 >>> network issues...)?
 >>> What versions of Hector and Cassandra are you running?
 >>>
 >>> Keep in mind that failures like this will normally be retried by
 >>> Hector, resulting in no loss of data. For that reason, I think that
 >>> exception is logged as a warning in the newest Hector versions.
 >>>
 >>> We've seen something similar, but more catastrophic because it
 > affects
 >>> connectivity to the entire cluster, not just a single node. See
this
 >>> post for more details: http://goo.gl/hrgkw So far we haven't
 >>> identified the cause.
 >>>
 >>> Jim
 >>>
 >>>> Anthony
 >>>>
 >>>> 2011-08-02 08:43:06,541 ERROR
 >>>> [me.prettyprint.cassandra.connection.HThriftClient] - Could not
 > flush
 >>>> transport (to be expected if the pool is shutting down) in close
 > for
 >>>> client:
 >>>> CassandraClient<cassandradevrk1:9393-33>
 >>>> org.apache.thrift.transport.TTransportException:
 >>>> java.net.SocketException:
 >>>> Broken pipe
 >>>> at
 >>>>
 >>>>
 >
org.apache.thrift.transport.TIOStreamTransport.write(TIOStreamTransport.java:147)
 >>>> at
 >>>>
 >>>>
 >
org.apache.thrift.transport.TFramedTransport.flush(TFramedTransport.java:156)
 >>>> at
 >>>>
 >>>>
 >
me.prettyprint.cassandra.connection.HThriftClient.close(HThriftClient.java:85)
 >>>> at
 >>>>
 >>>>
 >
me.prettyprint.cassandra.connection.HConnectionManager.operateWithFailover(HConnectionManager.java:232)
 >>>> at
 >>>>
 >>>>
 >
me.prettyprint.cassandra.service.KeyspaceServiceImpl.operateWithFailover(KeyspaceServiceImpl.java:131)
 >>>> at
 >>>>
 >>>>
 >
me.prettyprint.cassandra.service.KeyspaceServiceImpl.getSlice(KeyspaceServiceImpl.java:289)
 >>>> at
 >>>>
 >>>>
 >
me.prettyprint.cassandra.model.thrift.ThriftSliceQuery$1.doInKeyspace(ThriftSliceQuery.java:53)
 >>>> at
 >>>>
 >>>>
 >
me.prettyprint.cassandra.model.thrift.ThriftSliceQuery$1.doInKeyspace(ThriftSliceQuery.java:49)
 >>>> at
 >>>>
 >>>>
 >
me.prettyprint.cassandra.model.KeyspaceOperationCallback.doInKeyspaceAndMeasure(KeyspaceOperationCallback.java:20)
 >>>> at
 >>>>
 >>>>
 >
me.prettyprint.cassandra.model.ExecutingKeyspace.doExecute(ExecutingKeyspace.java:85)
 >>>> at
 >>>>
 >>>>
 >
me.prettyprint.cassandra.model.thrift.ThriftSliceQuery.execute(ThriftSliceQuery.java:48)
 >>>> at
 >>>>
 >>>>
 >
com.wsgc.services.registry.persistenceservice.impl.cassandra.strategy.read.StandardFindRegistryPersistenceStrategy.findRegistryByProfileId(StandardFindRegistryPersistenceStrategy.java:237)
 >>>> at
 >>>>
 >>>>
 >
com.wsgc.services.registry.persistenceservice.impl.cassandra.strategy.read.StandardFindRegistryPersistenceStrategy.execute(StandardFindRegistryPersistenceStrategy.java:277)
 >>>> at
 >>>>
 >>>>
 >
com.wsgc.services.registry.registryservice.impl.service.StandardRegistryService.getRegistriesByProfileId(StandardRegistryService.java:327)
 >>>> at
 >>>>
 >>>>
 >
com.wsgc.services.registry.webapp.impl.RegistryServicesController.getRegistriesByProfileId(RegistryServicesController.java:247)
 >>>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native
 > Method)
 >>>> at
 >>>>
 >>>>
 >
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
 >>>> at
 >>>>
 >>>>
 >
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
 >>>> at java.lang.reflect.Method.invoke(Method.java:597)
 >>>> at
 >>>>
 >>>>
 >
org.springframework.web.bind.annotation.support.HandlerMethodInvoker.invokeHandlerMethod(HandlerMethodInvoker.java:175)
 >>>> at
 >>>>
 >>>>
 >
org.springframework.web.servlet.mvc.annotation.AnnotationMethodHandlerAdapter.invokeHandlerMethod(AnnotationMethodHandlerAdapter.java:421)
 >>>> at
 >>>>
 >>>>
 >
org.springframework.web.servlet.mvc.annotation.AnnotationMethodHandlerAdapter.handle(AnnotationMethodHandlerAdapter.java:409)
 >>>> at
 >>>>
 >>>>
 >
org.springframework.web.servlet.DispatcherServlet.doDispatch(DispatcherServlet.java:774)
 >>>> at
 >>>>
 >>>>
 >
org.springframework.web.servlet.DispatcherServlet.doService(DispatcherServlet.java:719)
 >>>> at
 >>>>
 >>>>
 >
org.springframework.web.servlet.FrameworkServlet.processRequest(FrameworkServlet.java:644)
 >>>> at
 >>>>
 >>>>
 >
org.springframework.web.servlet.FrameworkServlet.doGet(FrameworkServlet.java:549)
 >>>> at
 > javax.servlet.http.HttpServlet.service(HttpServlet.java:617)
 >>>> at
 > javax.servlet.http.HttpServlet.service(HttpServlet.java:717)
 >>>> at
 >>>>
 >>>>
 >
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:290)
 >>>> at
 >>>>
 >>>>
 >
org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
 >>>> at
 >>>>
 >>>>
 >
org.springframework.web.filter.HiddenHttpMethodFilter.doFilterInternal(HiddenHttpMethodFilter.java:77)
 >>>> at
 >>>>
 >>>>
 >
org.springframework.web.filter.OncePerRequestFilter.doFilter(OncePerRequestFilter.java:76)
 >>>> at
 >>>>
 >>>>
 >
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
 >>>> at
 >>>>
 >>>>
 >
org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
 >>>> at
 >>>>
 >>>>
 >
org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233)
 >>>> at
 >>>>
 >>>>
 >
org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191)
 >>>> at
 >>>>
 >>>>
 >
org.apache.catalina.authenticator.AuthenticatorBase.invoke(AuthenticatorBase.java:563)
 >>>> at
 >>>>
 >>>>
 >
org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127)
 >>>> at
 >>>>
 >>>>
 >
org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
 >>>> at
 >>>>
 >>>>
 >
org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
 >>>> at
 >>>>
 >>>>
 >
org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:298)
 >>>> at
 >>>>
 > org.apache.jk.server.JkCoyoteHandler.invoke(JkCoyoteHandler.java:190)
 >>>> at
 >>>>
 > org.apache.jk.common.HandlerRequest.invoke(HandlerRequest.java:291)
 >>>> at
 >>>> org.apache.jk.common.ChannelSocket.invoke(ChannelSocket.java:774)
 >>>> at
 >>>>
 >>>>
 >
org.apache.jk.common.ChannelSocket.processConnection(ChannelSocket.java:703)
 >>>> at
 >>>>
 >>>>
 >
org.apache.jk.common.ChannelSocket$SocketConnection.runIt(ChannelSocket.java:896)
 >>>> at
 >>>>
 >>>>
 >
org.apache.tomcat.util.threads.ThreadPool$ControlRunnable.run(ThreadPool.java:690)
 >>>> at java.lang.Thread.run(Thread.java:662)
 >>>> Caused by: java.net.SocketException: Broken pipe
 >>>> at java.net.SocketOutputStream.socketWrite0(Native Method)
 >>>> at
 >>>>
 > java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:92)
 >>>> at
 > java.net.SocketOutputStream.write(SocketOutputStream.java:136)
 >>>> at
 >>>>
 >>>>
 >
org.apache.thrift.transport.TIOStreamTransport.write(TIOStreamTransport.java:145)
 >>>> ... 47 more
 >>>> 2011-08-02 08:43:06,543 ERROR
 >>>> [me.prettyprint.cassandra.connection.HConnectionManager] - MARK
 > HOST AS
 >>>> DOWN
 >>>> TRIGGERED for host cassandradevrk1(10.130.202.34):9393
 >>>> 2011-08-02 08:43:06,543 ERROR
 >>>> [me.prettyprint.cassandra.connection.HConnectionManager] - Pool
 > state on
 >>>> shutdown:
 >>>>
 >>>>
 >
<ConcurrentCassandraClientPoolByHost>:{cassandradevrk1(10.130.202.34):9393};
 >>>> IsActive?: true; Active: 1; Blocked: 0; Idle: 15;
 > NumBeforeExhausted: 49
 >>>> 2011-08-02 08:43:06,543 ERROR
 >>>> [me.prettyprint.cassandra.connection.ConcurrentHClientPool] -
 > Shutdown
 >>>> triggered on
 >>>>
 >>>>
 >
<ConcurrentCassandraClientPoolByHost>:{cassandradevrk1(10.130.202.34):9393}
 >>>> 2011-08-02 08:43:06,544 ERROR
 >>>> [me.prettyprint.cassandra.connection.ConcurrentHClientPool] -
 > Shutdown
 >>>> complete on
 >>>>
 >>>>
 >
<ConcurrentCassandraClientPoolByHost>:{cassandradevrk1(10.130.202.34):9393}
 >>>> 2011-08-02 08:43:06,544 INFO
 >>>> [me.prettyprint.cassandra.connection.CassandraHostRetryService] -
 > Host
 >>>> detected as down was added to retry queue:
 >>>> cassandradevrk1(10.130.202.34):9393
 >>>> 2011-08-02 08:43:06,544 WARN
 >>>> [me.prettyprint.cassandra.connection.HConnectionManager] - Could
 > not
 >>>> fullfill request on this host
 > CassandraClient<cassandradevrk1:9393-33>
 >>>>
 >>
 >>
 >

RE: Trying to find the problem with a broken pipe

Reply via email to