Re: Trying to find the problem with a broken pipe

aaron morton Thu, 04 Aug 2011 15:58:46 -0700

It's probably a network thing. 

The only thing I can think of in cassandra is thrift_max_message_length_in_mb 
in the config. That config setting will result in a TException thrown on the 
server side (i think), not sure if that makes the server kill the socket. I 
would hope the error returns to the client.


Perhaps check the server log. 

Cheers
 
-----------------
Aaron Morton
Freelance Cassandra Developer
@aaronmorton
http://www.thelastpickle.com

On 4 Aug 2011, at 23:05, Tim Snyder wrote:

> I am getting the same problem (Broken Pipe) on a loader program, after
> about 8 million read, write pairs. I am pushing serialized objects into
> a column with the program, the object it seems to be doing it on is much
> larger than the prior objects, so I am wondering if it is possibly a
> column size streaming issue through the thrift api? I am using Cassandra
> 0.8.0 and Hector 0.8.0-1
> 
> Tim
> 
> -------- Original Message --------
> Subject: Re: Trying to find the problem with a broken pipe
> From: Anthony Ikeda <anthony.ikeda....@gmail.com>
> Date: Tue, August 02, 2011 10:43 pm
> To: user@cassandra.apache.org
> 
>> Very interesting. After the second host goes down do you see
>> "me.prettyprint.hector.api.exceptions.HectorException: All host pools
>> marked down. Retry burden pushed out to client"?
> 
> No, the last message is:
> 2011-08-02 08:43:06,561 INFO
> [me.prettyprint.cassandra.connection.HConnectionManager] - Client
> CassandraClient<cassandradevrk2:9393-49> released to inactive or dead
> pool. Closing.> Does your client recover after a period of time?
> 
> 
> 
> The application seems to be fine for now but my concern is the
> connection pooling as well - I mean do we have one pool or multiple?
> I'll post to the Hector user group about the pooling because the
> incident seems so isolated. We also have our infrastructure team looking
> into the communication between the application server and the cassandra
> nodes.
> 
> 
> So far it's still a mystery.
> 
> 
> 
> 
> 
> On Tue, Aug 2, 2011 at 1:25 PM, Jim Ancona <j...@anconafamily.com> wrote:
> On Tue, Aug 2, 2011 at 6:13 PM, Anthony Ikeda
> <anthony.ikeda....@gmail.com> wrote:
> 
>> The link (which I may be misreading)
>> is
> http://groups.google.com/group/hector-users/browse_thread/thread/8d7004b6f85a0f2e
> 
> 
> I hadn't found that one, but I doubt that our issue is related to that.
> 
> 
>> It's only started happening today and happened on 2 occassions (8:43
> and
>> 10:21) performing the same function (querying a column family).
>> It seems to be trying to access a connection on one of the servers
>> The client accesses the first node:
>> 
>> 2011-08-02 08:43:06,541 ERROR
>> [me.prettyprint.cassandra.connection.HThriftClient] - Could not flush
>> transport (to be expected if the pool is shutting down) in close for
> client:
>> CassandraClient<cassandradevrk1:9393-33>
>> org.apache.thrift.transport.TTransportException:
> java.net.SocketException:
>> Broken pipe
>> 
>> ...
>> 2011-08-02 08:43:06,544 WARN
>> [me.prettyprint.cassandra.connection.HConnectionManager] - Could not
>> fullfill request on this host
> CassandraClient<cassandradevrk1:9393-33>
>> 
>> ...
>> 
>> 2011-08-02 08:43:06,543 ERROR
>> [me.prettyprint.cassandra.connection.HConnectionManager] - MARK HOST
> AS DOWN
>> TRIGGERED for host cassandradevrk1(10.130.202.34):9393
>> 2011-08-02 08:43:06,543 ERROR
>> [me.prettyprint.cassandra.connection.HConnectionManager] - Pool state
> on
>> shutdown:
>> 
> <ConcurrentCassandraClientPoolByHost>:{cassandradevrk1(10.130.202.34):9393};
>> IsActive?: true; Active: 1; Blocked: 0; Idle: 15; NumBeforeExhausted:
> 49
>> 2011-08-02 08:43:06,543 ERROR
>> [me.prettyprint.cassandra.connection.ConcurrentHClientPool] -
> Shutdown
>> triggered on
>> 
> <ConcurrentCassandraClientPoolByHost>:{cassandradevrk1(10.130.202.34):9393}
>> 2011-08-02 08:43:06,544 ERROR
>> [me.prettyprint.cassandra.connection.ConcurrentHClientPool] -
> Shutdown
>> complete on
>> 
> <ConcurrentCassandraClientPoolByHost>:{cassandradevrk1(10.130.202.34):9393}
>> 2011-08-02 08:43:06,544 INFO
>> [me.prettyprint.cassandra.connection.CassandraHostRetryService] -
> Host
>> detected as down was added to retry queue:
>> cassandradevrk1(10.130.202.34):9393
>> 2011-08-02 08:43:06,544 WARN
>> [me.prettyprint.cassandra.connection.HConnectionManager] - Could not
>> fullfill request on this host CassandraClient<
>> cassandradevrk1:9393-33>
>> 2011-08-02 08:43:06,544 WARN
>> [me.prettyprint.cassandra.connection.HConnectionManager] - Exception:
>> me.prettyprint.hector.api.exceptions.HectorTransportException:
>> org.apache.thrift.transport.TTransportException:
> java.net.SocketException:
>> Connection reset
>> 
>> 
>> Then it appears to try the second node and fails:
>> 
>> 2011-08-02 08:43:06,556 INFO
>> [me.prettyprint.cassandra.connection.HConnectionManager] - Client
>> CassandraClient<cassandradevrk1:9393-33> released to inactive or dead
> pool.
>> Closing.
>> 2011-08-02 08:43:06,557 ERROR
>> [me.prettyprint.cassandra.connection.HThriftClient] - Could not flush
>> transport (to be expected if the pool is shutting down) in close for
> client:
>> CassandraClient<cassandradevrk2:9393-49>
>> 
>> org.apache.thrift.transport.TTransportException:
> java.net.SocketException:
>> Broken pipe
>> 
>> 2011-08-02 08:43:06,558 ERROR
>> [me.prettyprint.cassandra.connection.HConnectionManager] - MARK HOST
> AS DOWN
>> TRIGGERED for host cassandradevrk2(10.130.202.35):9393
>> 2011-08-02 08:43:06,559 ERROR
>> [me.prettyprint.cassandra.connection.HConnectionManager] - Pool state
> on
>> shutdown:
>> 
> <ConcurrentCassandraClientPoolByHost>:{cassandradevrk2(10.130.202.35):9393};
>> IsActive?: true; Active: 1; Blocked: 0; Idle: 15; NumBeforeExhausted:
> 49
>> 2011-08-02 08:43:06,559 ERROR
>> [me.prettyprint.cassandra.connection.ConcurrentHClientPool] -
> Shutdown
>> triggered on
>> 
> <ConcurrentCassandraClientPoolByHost>:{cassandradevrk2(10.130.202.35):9393}
>> 2011-08-02 08:43:06,559 ERROR
>> [me.prettyprint.cassandra.connection.ConcurrentHClientPool] -
> Shutdown
>> complete on
>> 
> <ConcurrentCassandraClientPoolByHost>:{cassandradevrk2(10.130.202.35):9393}
>> 2011-08-02 08:43:06,559 INFO
>> [me.prettyprint.cassandra.connection.CassandraHostRetryService] -
> Host
>> detected as down was added to retry queue:
>> cassandradevrk2(10.130.202.35):9393
>> 2011-08-02 08:43:06,560 WARN
>> [me.prettyprint.cassandra.connection.HConnectionManager] - Could not
>> fullfill request on this host
> CassandraClient<cassandradevrk2:9393-49>
>> 2011-08-02 08:43:06,560 WARN
>> [me.prettyprint.cassandra.connection.HConnectionManager] - Exception:
>> me.prettyprint.hector.api.exceptions.HectorTransportException:
>> org.apache.thrift.transport.TTransportException:
> java.net.SocketException:
>> Connection reset
> 
> 
> 
> Very interesting. After the second host goes down do you see
> "me.prettyprint.hector.api.exceptions.HectorException: All host pools
> marked down. Retry burden pushed out to client"?
> 
> Does your client recover after a period of time?
> 
>> 
>> The process is the same at 10:21.
>> Are the exceptions related to any external events (e.g. node
>> restarts, network issues...)?
>> Not that I'm aware, unless there are firewall timeouts between the
>> application and the node servers. Let me find out. The cassandra log
> files
>> have no errors reported.
>> What versions of Hector and Cassandra are you running?
>> Cassandra 0.8.1, Hector 0.8.0-1
> 
> 
> Our issue is occurring with Cassandra 0.7.8 and Hector 0.7-30. We plan
> to deploy Hector 0.7-31 this week and to turn on useSocketKeepalive.
> Are you using that? We're also using tcpdump to capture packets when
> failures occur to see if there are anomalies in the network traffic.
> 
> Jim
> 
> 
> 
>> 
>> 
>> 
>> On Tue, Aug 2, 2011 at 10:37 AM, Jim Ancona <j...@anconafamily.com>
> wrote:
>>> 
>>> On Tue, Aug 2, 2011 at 4:36 PM, Anthony Ikeda
>>> <anthony.ikeda....@gmail.com> wrote:
>>>> I'm not sure if this is a problem with Hector or with Cassandra.
>>>> We seem to be seeing broken pipe issues with our connections on
> the
>>>> client
>>>> side (Exception below). A bit of googling finds possibly a problem
> with
>>>> the
>>>> amount of data we are trying to store, although I'm certain our
> datasets
>>>> are
>>>> not all that large.
>>> 
>>> I'm not sure what you're referring to here. Large requests could
> lead
>>> to timeouts, but that's not what you're seeing here. Could you link
> to
>>> the page you're referencing?
>>> 
>>>> A nodetool ring command doesn't seem to present any downed nodes:
>>>> Address         DC          Rack        Status State   Load
>>>> Owns
>>>>   Token
>>>> 
>>>>  153951716904446304929228999025275230571
>>>> 10.130.202.34   datacenter1 rack1       Up     Normal  470.74 KB
>>>> 79.19%  118538200848404459763384037192174096102
>>>> 10.130.202.35   datacenter1 rack1       Up     Normal  483.63 KB
>>>> 20.81%  153951716904446304929228999025275230571
>>>> 
>>>> There are no errors in the cassandra server logs.
>>>> 
>>>> Are there any particular timeouts on connections that we need to
> be
>>>> aware
>>>> of? Or perhaps configure on the Cassandra nodes? Is this purely
> and
>>>> issue
>>>> with the Hector API configuration?
>>> 
>>> There is a server side timeout (rpc_timeout_in_ms in cassandra.yaml)
>>> and a Hector client-side timeout
>>> (CassandraHostConfigurator.cassandraThriftSocketTimeout). But again,
>>> the "Broken pipe" error is not a timeout, it indicates that
> something
>>> happened to the underlying network socket. For example you will see
>>> those when a server node is restarted.
>>> 
>>> Some questions that might help troubleshoot this:
>>> How often are these occurring?
>>> Does this affect both nodes in the cluster or just one?
>>> Are the exceptions related to any external events (e.g. node
> restarts,
>>> network issues...)?
>>> What versions of Hector and Cassandra are you running?
>>> 
>>> Keep in mind that failures like this will normally be retried by
>>> Hector, resulting in no loss of data. For that reason, I think that
>>> exception is logged as a warning in the newest Hector versions.
>>> 
>>> We've seen something similar, but more catastrophic because it
> affects
>>> connectivity to the entire cluster, not just a single node. See this
>>> post for more details: http://goo.gl/hrgkw So far we haven't
>>> identified the cause.
>>> 
>>> Jim
>>> 
>>>> Anthony
>>>> 
>>>> 2011-08-02 08:43:06,541 ERROR
>>>> [me.prettyprint.cassandra.connection.HThriftClient] - Could not
> flush
>>>> transport (to be expected if the pool is shutting down) in close
> for
>>>> client:
>>>> CassandraClient<cassandradevrk1:9393-33>
>>>> org.apache.thrift.transport.TTransportException:
>>>> java.net.SocketException:
>>>> Broken pipe
>>>>      at
>>>> 
>>>> 
> org.apache.thrift.transport.TIOStreamTransport.write(TIOStreamTransport.java:147)
>>>>      at
>>>> 
>>>> 
> org.apache.thrift.transport.TFramedTransport.flush(TFramedTransport.java:156)
>>>>      at
>>>> 
>>>> 
> me.prettyprint.cassandra.connection.HThriftClient.close(HThriftClient.java:85)
>>>>      at
>>>> 
>>>> 
> me.prettyprint.cassandra.connection.HConnectionManager.operateWithFailover(HConnectionManager.java:232)
>>>>      at
>>>> 
>>>> 
> me.prettyprint.cassandra.service.KeyspaceServiceImpl.operateWithFailover(KeyspaceServiceImpl.java:131)
>>>>      at
>>>> 
>>>> 
> me.prettyprint.cassandra.service.KeyspaceServiceImpl.getSlice(KeyspaceServiceImpl.java:289)
>>>>      at
>>>> 
>>>> 
> me.prettyprint.cassandra.model.thrift.ThriftSliceQuery$1.doInKeyspace(ThriftSliceQuery.java:53)
>>>>      at
>>>> 
>>>> 
> me.prettyprint.cassandra.model.thrift.ThriftSliceQuery$1.doInKeyspace(ThriftSliceQuery.java:49)
>>>>      at
>>>> 
>>>> 
> me.prettyprint.cassandra.model.KeyspaceOperationCallback.doInKeyspaceAndMeasure(KeyspaceOperationCallback.java:20)
>>>>      at
>>>> 
>>>> 
> me.prettyprint.cassandra.model.ExecutingKeyspace.doExecute(ExecutingKeyspace.java:85)
>>>>      at
>>>> 
>>>> 
> me.prettyprint.cassandra.model.thrift.ThriftSliceQuery.execute(ThriftSliceQuery.java:48)
>>>>      at
>>>> 
>>>> 
> com.wsgc.services.registry.persistenceservice.impl.cassandra.strategy.read.StandardFindRegistryPersistenceStrategy.findRegistryByProfileId(StandardFindRegistryPersistenceStrategy.java:237)
>>>>      at
>>>> 
>>>> 
> com.wsgc.services.registry.persistenceservice.impl.cassandra.strategy.read.StandardFindRegistryPersistenceStrategy.execute(StandardFindRegistryPersistenceStrategy.java:277)
>>>>      at
>>>> 
>>>> 
> com.wsgc.services.registry.registryservice.impl.service.StandardRegistryService.getRegistriesByProfileId(StandardRegistryService.java:327)
>>>>      at
>>>> 
>>>> 
> com.wsgc.services.registry.webapp.impl.RegistryServicesController.getRegistriesByProfileId(RegistryServicesController.java:247)
>>>>      at sun.reflect.NativeMethodAccessorImpl.invoke0(Native
> Method)
>>>>      at
>>>> 
>>>> 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>>>>      at
>>>> 
>>>> 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>>>>      at java.lang.reflect.Method.invoke(Method.java:597)
>>>>      at
>>>> 
>>>> 
> org.springframework.web.bind.annotation.support.HandlerMethodInvoker.invokeHandlerMethod(HandlerMethodInvoker.java:175)
>>>>      at
>>>> 
>>>> 
> org.springframework.web.servlet.mvc.annotation.AnnotationMethodHandlerAdapter.invokeHandlerMethod(AnnotationMethodHandlerAdapter.java:421)
>>>>      at
>>>> 
>>>> 
> org.springframework.web.servlet.mvc.annotation.AnnotationMethodHandlerAdapter.handle(AnnotationMethodHandlerAdapter.java:409)
>>>>      at
>>>> 
>>>> 
> org.springframework.web.servlet.DispatcherServlet.doDispatch(DispatcherServlet.java:774)
>>>>      at
>>>> 
>>>> 
> org.springframework.web.servlet.DispatcherServlet.doService(DispatcherServlet.java:719)
>>>>      at
>>>> 
>>>> 
> org.springframework.web.servlet.FrameworkServlet.processRequest(FrameworkServlet.java:644)
>>>>      at
>>>> 
>>>> 
> org.springframework.web.servlet.FrameworkServlet.doGet(FrameworkServlet.java:549)
>>>>      at
> javax.servlet.http.HttpServlet.service(HttpServlet.java:617)
>>>>      at
> javax.servlet.http.HttpServlet.service(HttpServlet.java:717)
>>>>      at
>>>> 
>>>> 
> org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:290)
>>>>      at
>>>> 
>>>> 
> org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
>>>>      at
>>>> 
>>>> 
> org.springframework.web.filter.HiddenHttpMethodFilter.doFilterInternal(HiddenHttpMethodFilter.java:77)
>>>>      at
>>>> 
>>>> 
> org.springframework.web.filter.OncePerRequestFilter.doFilter(OncePerRequestFilter.java:76)
>>>>      at
>>>> 
>>>> 
> org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
>>>>      at
>>>> 
>>>> 
> org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
>>>>      at
>>>> 
>>>> 
> org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233)
>>>>      at
>>>> 
>>>> 
> org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191)
>>>>      at
>>>> 
>>>> 
> org.apache.catalina.authenticator.AuthenticatorBase.invoke(AuthenticatorBase.java:563)
>>>>      at
>>>> 
>>>> 
> org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127)
>>>>      at
>>>> 
>>>> 
> org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
>>>>      at
>>>> 
>>>> 
> org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
>>>>      at
>>>> 
>>>> 
> org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:298)
>>>>      at
>>>> 
> org.apache.jk.server.JkCoyoteHandler.invoke(JkCoyoteHandler.java:190)
>>>>      at
>>>> 
> org.apache.jk.common.HandlerRequest.invoke(HandlerRequest.java:291)
>>>>      at
>>>> org.apache.jk.common.ChannelSocket.invoke(ChannelSocket.java:774)
>>>>      at
>>>> 
>>>> 
> org.apache.jk.common.ChannelSocket.processConnection(ChannelSocket.java:703)
>>>>      at
>>>> 
>>>> 
> org.apache.jk.common.ChannelSocket$SocketConnection.runIt(ChannelSocket.java:896)
>>>>      at
>>>> 
>>>> 
> org.apache.tomcat.util.threads.ThreadPool$ControlRunnable.run(ThreadPool.java:690)
>>>>      at java.lang.Thread.run(Thread.java:662)
>>>> Caused by: java.net.SocketException: Broken pipe
>>>>      at java.net.SocketOutputStream.socketWrite0(Native Method)
>>>>      at
>>>> 
> java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:92)
>>>>      at
> java.net.SocketOutputStream.write(SocketOutputStream.java:136)
>>>>      at
>>>> 
>>>> 
> org.apache.thrift.transport.TIOStreamTransport.write(TIOStreamTransport.java:145)
>>>>      ... 47 more
>>>> 2011-08-02 08:43:06,543 ERROR
>>>> [me.prettyprint.cassandra.connection.HConnectionManager] - MARK
> HOST AS
>>>> DOWN
>>>> TRIGGERED for host cassandradevrk1(10.130.202.34):9393
>>>> 2011-08-02 08:43:06,543 ERROR
>>>> [me.prettyprint.cassandra.connection.HConnectionManager] - Pool
> state on
>>>> shutdown:
>>>> 
>>>> 
> <ConcurrentCassandraClientPoolByHost>:{cassandradevrk1(10.130.202.34):9393};
>>>> IsActive?: true; Active: 1; Blocked: 0; Idle: 15;
> NumBeforeExhausted: 49
>>>> 2011-08-02 08:43:06,543 ERROR
>>>> [me.prettyprint.cassandra.connection.ConcurrentHClientPool] -
> Shutdown
>>>> triggered on
>>>> 
>>>> 
> <ConcurrentCassandraClientPoolByHost>:{cassandradevrk1(10.130.202.34):9393}
>>>> 2011-08-02 08:43:06,544 ERROR
>>>> [me.prettyprint.cassandra.connection.ConcurrentHClientPool] -
> Shutdown
>>>> complete on
>>>> 
>>>> 
> <ConcurrentCassandraClientPoolByHost>:{cassandradevrk1(10.130.202.34):9393}
>>>> 2011-08-02 08:43:06,544 INFO
>>>> [me.prettyprint.cassandra.connection.CassandraHostRetryService] -
> Host
>>>> detected as down was added to retry queue:
>>>> cassandradevrk1(10.130.202.34):9393
>>>> 2011-08-02 08:43:06,544 WARN
>>>> [me.prettyprint.cassandra.connection.HConnectionManager] - Could
> not
>>>> fullfill request on this host
> CassandraClient<cassandradevrk1:9393-33>
>>>> 
>> 
>> 
>

Re: Trying to find the problem with a broken pipe

Reply via email to