RE: Trying to find the problem with a broken pipe

Tim Snyder Thu, 04 Aug 2011 06:06:33 -0700

I am getting the same problem (Broken Pipe) on a loader program, after
about 8 million read, write pairs. I am pushing serialized objects into
a column with the program, the object it seems to be doing it on is much
larger than the prior objects, so I am wondering if it is possibly a
column size streaming issue through the thrift api? I am using Cassandra
0.8.0 and Hector 0.8.0-1


Tim

-------- Original Message --------
Subject: Re: Trying to find the problem with a broken pipe
From: Anthony Ikeda <anthony.ikeda....@gmail.com>
Date: Tue, August 02, 2011 10:43 pm
To: user@cassandra.apache.org

> Very interesting. After the second host goes down do you see
> "me.prettyprint.hector.api.exceptions.HectorException: All host pools
 > marked down. Retry burden pushed out to client"?

No, the last message is:
2011-08-02 08:43:06,561 INFO
[me.prettyprint.cassandra.connection.HConnectionManager] - Client
CassandraClient<cassandradevrk2:9393-49> released to inactive or dead
pool. Closing.> Does your client recover after a period of time?


 
The application seems to be fine for now but my concern is the
connection pooling as well - I mean do we have one pool or multiple?
I'll post to the Hector user group about the pooling because the
incident seems so isolated. We also have our infrastructure team looking
into the communication between the application server and the cassandra
nodes.


So far it's still a mystery.





On Tue, Aug 2, 2011 at 1:25 PM, Jim Ancona <j...@anconafamily.com> wrote:
 On Tue, Aug 2, 2011 at 6:13 PM, Anthony Ikeda
 <anthony.ikeda....@gmail.com> wrote:
 
> The link (which I may be misreading)
 > is
http://groups.google.com/group/hector-users/browse_thread/thread/8d7004b6f85a0f2e
 
 
I hadn't found that one, but I doubt that our issue is related to that.
 

 > It's only started happening today and happened on 2 occassions (8:43
and
 > 10:21) performing the same function (querying a column family).
 > It seems to be trying to access a connection on one of the servers
 > The client accesses the first node:
 >
 > 2011-08-02 08:43:06,541 ERROR
 > [me.prettyprint.cassandra.connection.HThriftClient] - Could not flush
 > transport (to be expected if the pool is shutting down) in close for
client:
 > CassandraClient<cassandradevrk1:9393-33>
 > org.apache.thrift.transport.TTransportException:
java.net.SocketException:
 > Broken pipe
 >
 > ...
 > 2011-08-02 08:43:06,544 WARN
 > [me.prettyprint.cassandra.connection.HConnectionManager] - Could not
 > fullfill request on this host
CassandraClient<cassandradevrk1:9393-33>
 >
 > ...
 >
 > 2011-08-02 08:43:06,543 ERROR
 > [me.prettyprint.cassandra.connection.HConnectionManager] - MARK HOST
AS DOWN
 > TRIGGERED for host cassandradevrk1(10.130.202.34):9393
 > 2011-08-02 08:43:06,543 ERROR
 > [me.prettyprint.cassandra.connection.HConnectionManager] - Pool state
on
 > shutdown:
 >
<ConcurrentCassandraClientPoolByHost>:{cassandradevrk1(10.130.202.34):9393};
 > IsActive?: true; Active: 1; Blocked: 0; Idle: 15; NumBeforeExhausted:
49
 > 2011-08-02 08:43:06,543 ERROR
 > [me.prettyprint.cassandra.connection.ConcurrentHClientPool] -
Shutdown
 > triggered on
 >
<ConcurrentCassandraClientPoolByHost>:{cassandradevrk1(10.130.202.34):9393}
 > 2011-08-02 08:43:06,544 ERROR
 > [me.prettyprint.cassandra.connection.ConcurrentHClientPool] -
Shutdown
 > complete on
 >
<ConcurrentCassandraClientPoolByHost>:{cassandradevrk1(10.130.202.34):9393}
 > 2011-08-02 08:43:06,544 INFO
 > [me.prettyprint.cassandra.connection.CassandraHostRetryService] -
Host
 > detected as down was added to retry queue:
 > cassandradevrk1(10.130.202.34):9393
 > 2011-08-02 08:43:06,544 WARN
 > [me.prettyprint.cassandra.connection.HConnectionManager] - Could not
 > fullfill request on this host CassandraClient<
 > cassandradevrk1:9393-33>
 > 2011-08-02 08:43:06,544 WARN
 > [me.prettyprint.cassandra.connection.HConnectionManager] - Exception:
 > me.prettyprint.hector.api.exceptions.HectorTransportException:
 > org.apache.thrift.transport.TTransportException:
java.net.SocketException:
 > Connection reset
 >
 >
 > Then it appears to try the second node and fails:
 >
 > 2011-08-02 08:43:06,556 INFO
 > [me.prettyprint.cassandra.connection.HConnectionManager] - Client
 > CassandraClient<cassandradevrk1:9393-33> released to inactive or dead
pool.
 > Closing.
 > 2011-08-02 08:43:06,557 ERROR
 > [me.prettyprint.cassandra.connection.HThriftClient] - Could not flush
 > transport (to be expected if the pool is shutting down) in close for
client:
 > CassandraClient<cassandradevrk2:9393-49>
 >
 > org.apache.thrift.transport.TTransportException:
java.net.SocketException:
 > Broken pipe
 >
 > 2011-08-02 08:43:06,558 ERROR
 > [me.prettyprint.cassandra.connection.HConnectionManager] - MARK HOST
AS DOWN
 > TRIGGERED for host cassandradevrk2(10.130.202.35):9393
 > 2011-08-02 08:43:06,559 ERROR
 > [me.prettyprint.cassandra.connection.HConnectionManager] - Pool state
on
 > shutdown:
 >
<ConcurrentCassandraClientPoolByHost>:{cassandradevrk2(10.130.202.35):9393};
 > IsActive?: true; Active: 1; Blocked: 0; Idle: 15; NumBeforeExhausted:
49
 > 2011-08-02 08:43:06,559 ERROR
 > [me.prettyprint.cassandra.connection.ConcurrentHClientPool] -
Shutdown
 > triggered on
 >
<ConcurrentCassandraClientPoolByHost>:{cassandradevrk2(10.130.202.35):9393}
 > 2011-08-02 08:43:06,559 ERROR
 > [me.prettyprint.cassandra.connection.ConcurrentHClientPool] -
Shutdown
 > complete on
 >
<ConcurrentCassandraClientPoolByHost>:{cassandradevrk2(10.130.202.35):9393}
 > 2011-08-02 08:43:06,559 INFO
 > [me.prettyprint.cassandra.connection.CassandraHostRetryService] -
Host
 > detected as down was added to retry queue:
 > cassandradevrk2(10.130.202.35):9393
 > 2011-08-02 08:43:06,560 WARN
 > [me.prettyprint.cassandra.connection.HConnectionManager] - Could not
 > fullfill request on this host
CassandraClient<cassandradevrk2:9393-49>
 > 2011-08-02 08:43:06,560 WARN
 > [me.prettyprint.cassandra.connection.HConnectionManager] - Exception:
 > me.prettyprint.hector.api.exceptions.HectorTransportException:
 > org.apache.thrift.transport.TTransportException:
java.net.SocketException:
 > Connection reset
 
 

Very interesting. After the second host goes down do you see
 "me.prettyprint.hector.api.exceptions.HectorException: All host pools
 marked down. Retry burden pushed out to client"?
 
 Does your client recover after a period of time?
 
 >
 > The process is the same at 10:21.
 > Are the exceptions related to any external events (e.g. node
 > restarts, network issues...)?
 > Not that I'm aware, unless there are firewall timeouts between the
 > application and the node servers. Let me find out. The cassandra log
files
 > have no errors reported.
 > What versions of Hector and Cassandra are you running?
 > Cassandra 0.8.1, Hector 0.8.0-1
 
 
Our issue is occurring with Cassandra 0.7.8 and Hector 0.7-30. We plan
 to deploy Hector 0.7-31 this week and to turn on useSocketKeepalive.
 Are you using that? We're also using tcpdump to capture packets when
 failures occur to see if there are anomalies in the network traffic.
 
 Jim
 

 
 >
 >
 >
 > On Tue, Aug 2, 2011 at 10:37 AM, Jim Ancona <j...@anconafamily.com>
wrote:
 >>
 >> On Tue, Aug 2, 2011 at 4:36 PM, Anthony Ikeda
 >> <anthony.ikeda....@gmail.com> wrote:
 >> > I'm not sure if this is a problem with Hector or with Cassandra.
 >> > We seem to be seeing broken pipe issues with our connections on
the
 >> > client
 >> > side (Exception below). A bit of googling finds possibly a problem
with
 >> > the
 >> > amount of data we are trying to store, although I'm certain our
datasets
 >> > are
 >> > not all that large.
 >>
 >> I'm not sure what you're referring to here. Large requests could
lead
 >> to timeouts, but that's not what you're seeing here. Could you link
to
 >> the page you're referencing?
 >>
 >> > A nodetool ring command doesn't seem to present any downed nodes:
 >> > Address         DC          Rack        Status State   Load
 >> >  Owns
 >> >    Token
 >> >
 >> >   153951716904446304929228999025275230571
 >> > 10.130.202.34   datacenter1 rack1       Up     Normal  470.74 KB
 >> > 79.19%  118538200848404459763384037192174096102
 >> > 10.130.202.35   datacenter1 rack1       Up     Normal  483.63 KB
 >> > 20.81%  153951716904446304929228999025275230571
 >> >
 >> > There are no errors in the cassandra server logs.
 >> >
 >> > Are there any particular timeouts on connections that we need to
be
 >> > aware
 >> > of? Or perhaps configure on the Cassandra nodes? Is this purely
and
 >> > issue
 >> > with the Hector API configuration?
 >>
 >> There is a server side timeout (rpc_timeout_in_ms in cassandra.yaml)
 >> and a Hector client-side timeout
 >> (CassandraHostConfigurator.cassandraThriftSocketTimeout). But again,
 >> the "Broken pipe" error is not a timeout, it indicates that
something
 >> happened to the underlying network socket. For example you will see
 >> those when a server node is restarted.
 >>
 >> Some questions that might help troubleshoot this:
 >> How often are these occurring?
 >> Does this affect both nodes in the cluster or just one?
 >> Are the exceptions related to any external events (e.g. node
restarts,
 >> network issues...)?
 >> What versions of Hector and Cassandra are you running?
 >>
 >> Keep in mind that failures like this will normally be retried by
 >> Hector, resulting in no loss of data. For that reason, I think that
 >> exception is logged as a warning in the newest Hector versions.
 >>
 >> We've seen something similar, but more catastrophic because it
affects
 >> connectivity to the entire cluster, not just a single node. See this
 >> post for more details: http://goo.gl/hrgkw So far we haven't
 >> identified the cause.
 >>
 >> Jim
 >>
 >> > Anthony
 >> >
 >> > 2011-08-02 08:43:06,541 ERROR
 >> > [me.prettyprint.cassandra.connection.HThriftClient] - Could not
flush
 >> > transport (to be expected if the pool is shutting down) in close
for
 >> > client:
 >> > CassandraClient<cassandradevrk1:9393-33>
 >> > org.apache.thrift.transport.TTransportException:
 >> > java.net.SocketException:
 >> > Broken pipe
 >> >       at
 >> >
 >> >
org.apache.thrift.transport.TIOStreamTransport.write(TIOStreamTransport.java:147)
 >> >       at
 >> >
 >> >
org.apache.thrift.transport.TFramedTransport.flush(TFramedTransport.java:156)
 >> >       at
 >> >
 >> >
me.prettyprint.cassandra.connection.HThriftClient.close(HThriftClient.java:85)
 >> >       at
 >> >
 >> >
me.prettyprint.cassandra.connection.HConnectionManager.operateWithFailover(HConnectionManager.java:232)
 >> >       at
 >> >
 >> >
me.prettyprint.cassandra.service.KeyspaceServiceImpl.operateWithFailover(KeyspaceServiceImpl.java:131)
 >> >       at
 >> >
 >> >
me.prettyprint.cassandra.service.KeyspaceServiceImpl.getSlice(KeyspaceServiceImpl.java:289)
 >> >       at
 >> >
 >> >
me.prettyprint.cassandra.model.thrift.ThriftSliceQuery$1.doInKeyspace(ThriftSliceQuery.java:53)
 >> >       at
 >> >
 >> >
me.prettyprint.cassandra.model.thrift.ThriftSliceQuery$1.doInKeyspace(ThriftSliceQuery.java:49)
 >> >       at
 >> >
 >> >
me.prettyprint.cassandra.model.KeyspaceOperationCallback.doInKeyspaceAndMeasure(KeyspaceOperationCallback.java:20)
 >> >       at
 >> >
 >> >
me.prettyprint.cassandra.model.ExecutingKeyspace.doExecute(ExecutingKeyspace.java:85)
 >> >       at
 >> >
 >> >
me.prettyprint.cassandra.model.thrift.ThriftSliceQuery.execute(ThriftSliceQuery.java:48)
 >> >       at
 >> >
 >> >
com.wsgc.services.registry.persistenceservice.impl.cassandra.strategy.read.StandardFindRegistryPersistenceStrategy.findRegistryByProfileId(StandardFindRegistryPersistenceStrategy.java:237)
 >> >       at
 >> >
 >> >
com.wsgc.services.registry.persistenceservice.impl.cassandra.strategy.read.StandardFindRegistryPersistenceStrategy.execute(StandardFindRegistryPersistenceStrategy.java:277)
 >> >       at
 >> >
 >> >
com.wsgc.services.registry.registryservice.impl.service.StandardRegistryService.getRegistriesByProfileId(StandardRegistryService.java:327)
 >> >       at
 >> >
 >> >
com.wsgc.services.registry.webapp.impl.RegistryServicesController.getRegistriesByProfileId(RegistryServicesController.java:247)
 >> >       at sun.reflect.NativeMethodAccessorImpl.invoke0(Native
Method)
 >> >       at
 >> >
 >> >
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
 >> >       at
 >> >
 >> >
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
 >> >       at java.lang.reflect.Method.invoke(Method.java:597)
 >> >       at
 >> >
 >> >
org.springframework.web.bind.annotation.support.HandlerMethodInvoker.invokeHandlerMethod(HandlerMethodInvoker.java:175)
 >> >       at
 >> >
 >> >
org.springframework.web.servlet.mvc.annotation.AnnotationMethodHandlerAdapter.invokeHandlerMethod(AnnotationMethodHandlerAdapter.java:421)
 >> >       at
 >> >
 >> >
org.springframework.web.servlet.mvc.annotation.AnnotationMethodHandlerAdapter.handle(AnnotationMethodHandlerAdapter.java:409)
 >> >       at
 >> >
 >> >
org.springframework.web.servlet.DispatcherServlet.doDispatch(DispatcherServlet.java:774)
 >> >       at
 >> >
 >> >
org.springframework.web.servlet.DispatcherServlet.doService(DispatcherServlet.java:719)
 >> >       at
 >> >
 >> >
org.springframework.web.servlet.FrameworkServlet.processRequest(FrameworkServlet.java:644)
 >> >       at
 >> >
 >> >
org.springframework.web.servlet.FrameworkServlet.doGet(FrameworkServlet.java:549)
 >> >       at
javax.servlet.http.HttpServlet.service(HttpServlet.java:617)
 >> >       at
javax.servlet.http.HttpServlet.service(HttpServlet.java:717)
 >> >       at
 >> >
 >> >
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:290)
 >> >       at
 >> >
 >> >
org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
 >> >       at
 >> >
 >> >
org.springframework.web.filter.HiddenHttpMethodFilter.doFilterInternal(HiddenHttpMethodFilter.java:77)
 >> >       at
 >> >
 >> >
org.springframework.web.filter.OncePerRequestFilter.doFilter(OncePerRequestFilter.java:76)
 >> >       at
 >> >
 >> >
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
 >> >       at
 >> >
 >> >
org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
 >> >       at
 >> >
 >> >
org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233)
 >> >       at
 >> >
 >> >
org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191)
 >> >       at
 >> >
 >> >
org.apache.catalina.authenticator.AuthenticatorBase.invoke(AuthenticatorBase.java:563)
 >> >       at
 >> >
 >> >
org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127)
 >> >       at
 >> >
 >> >
org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
 >> >       at
 >> >
 >> >
org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
 >> >       at
 >> >
 >> >
org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:298)
 >> >       at
 >> >
org.apache.jk.server.JkCoyoteHandler.invoke(JkCoyoteHandler.java:190)
 >> >       at
 >> >
org.apache.jk.common.HandlerRequest.invoke(HandlerRequest.java:291)
 >> >       at
 >> > org.apache.jk.common.ChannelSocket.invoke(ChannelSocket.java:774)
 >> >       at
 >> >
 >> >
org.apache.jk.common.ChannelSocket.processConnection(ChannelSocket.java:703)
 >> >       at
 >> >
 >> >
org.apache.jk.common.ChannelSocket$SocketConnection.runIt(ChannelSocket.java:896)
 >> >       at
 >> >
 >> >
org.apache.tomcat.util.threads.ThreadPool$ControlRunnable.run(ThreadPool.java:690)
 >> >       at java.lang.Thread.run(Thread.java:662)
 >> > Caused by: java.net.SocketException: Broken pipe
 >> >       at java.net.SocketOutputStream.socketWrite0(Native Method)
 >> >       at
 >> >
java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:92)
 >> >       at
java.net.SocketOutputStream.write(SocketOutputStream.java:136)
 >> >       at
 >> >
 >> >
org.apache.thrift.transport.TIOStreamTransport.write(TIOStreamTransport.java:145)
 >> >       ... 47 more
 >> > 2011-08-02 08:43:06,543 ERROR
 >> > [me.prettyprint.cassandra.connection.HConnectionManager] - MARK
HOST AS
 >> > DOWN
 >> > TRIGGERED for host cassandradevrk1(10.130.202.34):9393
 >> > 2011-08-02 08:43:06,543 ERROR
 >> > [me.prettyprint.cassandra.connection.HConnectionManager] - Pool
state on
 >> > shutdown:
 >> >
 >> >
<ConcurrentCassandraClientPoolByHost>:{cassandradevrk1(10.130.202.34):9393};
 >> > IsActive?: true; Active: 1; Blocked: 0; Idle: 15;
NumBeforeExhausted: 49
 >> > 2011-08-02 08:43:06,543 ERROR
 >> > [me.prettyprint.cassandra.connection.ConcurrentHClientPool] -
Shutdown
 >> > triggered on
 >> >
 >> >
<ConcurrentCassandraClientPoolByHost>:{cassandradevrk1(10.130.202.34):9393}
 >> > 2011-08-02 08:43:06,544 ERROR
 >> > [me.prettyprint.cassandra.connection.ConcurrentHClientPool] -
Shutdown
 >> > complete on
 >> >
 >> >
<ConcurrentCassandraClientPoolByHost>:{cassandradevrk1(10.130.202.34):9393}
 >> > 2011-08-02 08:43:06,544 INFO
 >> > [me.prettyprint.cassandra.connection.CassandraHostRetryService] -
Host
 >> > detected as down was added to retry queue:
 >> > cassandradevrk1(10.130.202.34):9393
 >> > 2011-08-02 08:43:06,544 WARN
 >> > [me.prettyprint.cassandra.connection.HConnectionManager] - Could
not
 >> > fullfill request on this host
CassandraClient<cassandradevrk1:9393-33>
 >> >
 >
 >

RE: Trying to find the problem with a broken pipe

Reply via email to