Anthony, All I can say is that the steps I outlined below fixed my problem and allowed me to process 60 million rows of data.
Tim -------- Original Message -------- Subject: Re: Trying to find the problem with a broken pipe From: Anthony Ikeda <anthony.ikeda....@gmail.com> Date: Tue, August 09, 2011 3:11 am To: user@cassandra.apache.org Tim do you know if this is the actual reason that is causing the broken pipe? I'm having a hard time convincing my team that modifying this value will fix the issue. Jonathan, do you know if there is a valid explanation on why Tim no longer has the problem based on this change? Anthony Ikeda On Thu, Aug 4, 2011 at 11:13 PM, Tim Snyder <t...@proinnovations.com> wrote: I no longer get the error on the loader program. The steps I took to fix it are increasing the thrift_max_message_length_in_mb msg length, stopping cassandra, blowing away the prior data store, and then restarting cassandra. Tim -------- Original Message -------- Subject: Re: Trying to find the problem with a broken pipe From: aaron morton <aa...@thelastpickle.com> Date: Fri, August 05, 2011 12:58 am To: user@cassandra.apache.org It's probably a network thing. The only thing I can think of in cassandra is thrift_max_message_length_in_mb in the config. That config setting will result in a TException thrown on the server side (i think), not sure if that makes the server kill the socket. I would hope the error returns to the client. Perhaps check the server log. Cheers ----------------- Aaron Morton Freelance Cassandra Developer @aaronmorton http://www.thelastpickle.com On 4 Aug 2011, at 23:05, Tim Snyder wrote: > I am getting the same problem (Broken Pipe) on a loader program, after > about 8 million read, write pairs. I am pushing serialized objects into > a column with the program, the object it seems to be doing it on is much > larger than the prior objects, so I am wondering if it is possibly a > column size streaming issue through the thrift api? I am using Cassandra > 0.8.0 and Hector 0.8.0-1 > > Tim > > -------- Original Message -------- > Subject: Re: Trying to find the problem with a broken pipe > From: Anthony Ikeda <anthony.ikeda....@gmail.com> > Date: Tue, August 02, 2011 10:43 pm > To: user@cassandra.apache.org > >> Very interesting. After the second host goes down do you see >> "me.prettyprint.hector.api.exceptions.HectorException: All host pools >> marked down. Retry burden pushed out to client"? > > No, the last message is: > 2011-08-02 08:43:06,561 INFO > [me.prettyprint.cassandra.connection.HConnectionManager] - Client > CassandraClient<cassandradevrk2:9393-49> released to inactive or dead > pool. Closing.> Does your client recover after a period of time? > > > > The application seems to be fine for now but my concern is the > connection pooling as well - I mean do we have one pool or multiple? > I'll post to the Hector user group about the pooling because the > incident seems so isolated. We also have our infrastructure team looking > into the communication between the application server and the cassandra > nodes. > > > So far it's still a mystery. > > > > > > On Tue, Aug 2, 2011 at 1:25 PM, Jim Ancona <j...@anconafamily.com> wrote: > On Tue, Aug 2, 2011 at 6:13 PM, Anthony Ikeda > <anthony.ikeda....@gmail.com> wrote: > >> The link (which I may be misreading) >> is > http://groups.google.com/group/hector-users/browse_thread/thread/8d7004b6f85a0f2e > > > I hadn't found that one, but I doubt that our issue is related to that. > > >> It's only started happening today and happened on 2 occassions (8:43 > and >> 10:21) performing the same function (querying a column family). >> It seems to be trying to access a connection on one of the servers >> The client accesses the first node: >> >> 2011-08-02 08:43:06,541 ERROR >> [me.prettyprint.cassandra.connection.HThriftClient] - Could not flush >> transport (to be expected if the pool is shutting down) in close for > client: >> CassandraClient<cassandradevrk1:9393-33> >> org.apache.thrift.transport.TTransportException: > java.net.SocketException: >> Broken pipe >> >> ... >> 2011-08-02 08:43:06,544 WARN >> [me.prettyprint.cassandra.connection.HConnectionManager] - Could not >> fullfill request on this host > CassandraClient<cassandradevrk1:9393-33> >> >> ... >> >> 2011-08-02 08:43:06,543 ERROR >> [me.prettyprint.cassandra.connection.HConnectionManager] - MARK HOST > AS DOWN >> TRIGGERED for host cassandradevrk1(10.130.202.34):9393 >> 2011-08-02 08:43:06,543 ERROR >> [me.prettyprint.cassandra.connection.HConnectionManager] - Pool state > on >> shutdown: >> > <ConcurrentCassandraClientPoolByHost>:{cassandradevrk1(10.130.202.34):9393}; >> IsActive?: true; Active: 1; Blocked: 0; Idle: 15; NumBeforeExhausted: > 49 >> 2011-08-02 08:43:06,543 ERROR >> [me.prettyprint.cassandra.connection.ConcurrentHClientPool] - > Shutdown >> triggered on >> > <ConcurrentCassandraClientPoolByHost>:{cassandradevrk1(10.130.202.34):9393} >> 2011-08-02 08:43:06,544 ERROR >> [me.prettyprint.cassandra.connection.ConcurrentHClientPool] - > Shutdown >> complete on >> > <ConcurrentCassandraClientPoolByHost>:{cassandradevrk1(10.130.202.34):9393} >> 2011-08-02 08:43:06,544 INFO >> [me.prettyprint.cassandra.connection.CassandraHostRetryService] - > Host >> detected as down was added to retry queue: >> cassandradevrk1(10.130.202.34):9393 >> 2011-08-02 08:43:06,544 WARN >> [me.prettyprint.cassandra.connection.HConnectionManager] - Could not >> fullfill request on this host CassandraClient< >> cassandradevrk1:9393-33> >> 2011-08-02 08:43:06,544 WARN >> [me.prettyprint.cassandra.connection.HConnectionManager] - Exception: >> me.prettyprint.hector.api.exceptions.HectorTransportException: >> org.apache.thrift.transport.TTransportException: > java.net.SocketException: >> Connection reset >> >> >> Then it appears to try the second node and fails: >> >> 2011-08-02 08:43:06,556 INFO >> [me.prettyprint.cassandra.connection.HConnectionManager] - Client >> CassandraClient<cassandradevrk1:9393-33> released to inactive or dead > pool. >> Closing. >> 2011-08-02 08:43:06,557 ERROR >> [me.prettyprint.cassandra.connection.HThriftClient] - Could not flush >> transport (to be expected if the pool is shutting down) in close for > client: >> CassandraClient<cassandradevrk2:9393-49> >> >> org.apache.thrift.transport.TTransportException: > java.net.SocketException: >> Broken pipe >> >> 2011-08-02 08:43:06,558 ERROR >> [me.prettyprint.cassandra.connection.HConnectionManager] - MARK HOST > AS DOWN >> TRIGGERED for host cassandradevrk2(10.130.202.35):9393 >> 2011-08-02 08:43:06,559 ERROR >> [me.prettyprint.cassandra.connection.HConnectionManager] - Pool state > on >> shutdown: >> > <ConcurrentCassandraClientPoolByHost>:{cassandradevrk2(10.130.202.35):9393}; >> IsActive?: true; Active: 1; Blocked: 0; Idle: 15; NumBeforeExhausted: > 49 >> 2011-08-02 08:43:06,559 ERROR >> [me.prettyprint.cassandra.connection.ConcurrentHClientPool] - > Shutdown >> triggered on >> > <ConcurrentCassandraClientPoolByHost>:{cassandradevrk2(10.130.202.35):9393} >> 2011-08-02 08:43:06,559 ERROR >> [me.prettyprint.cassandra.connection.ConcurrentHClientPool] - > Shutdown >> complete on >> > <ConcurrentCassandraClientPoolByHost>:{cassandradevrk2(10.130.202.35):9393} >> 2011-08-02 08:43:06,559 INFO >> [me.prettyprint.cassandra.connection.CassandraHostRetryService] - > Host >> detected as down was added to retry queue: >> cassandradevrk2(10.130.202.35):9393 >> 2011-08-02 08:43:06,560 WARN >> [me.prettyprint.cassandra.connection.HConnectionManager] - Could not >> fullfill request on this host > CassandraClient<cassandradevrk2:9393-49> >> 2011-08-02 08:43:06,560 WARN >> [me.prettyprint.cassandra.connection.HConnectionManager] - Exception: >> me.prettyprint.hector.api.exceptions.HectorTransportException: >> org.apache.thrift.transport.TTransportException: > java.net.SocketException: >> Connection reset > > > > Very interesting. After the second host goes down do you see > "me.prettyprint.hector.api.exceptions.HectorException: All host pools > marked down. Retry burden pushed out to client"? > > Does your client recover after a period of time? > >> >> The process is the same at 10:21. >> Are the exceptions related to any external events (e.g. node >> restarts, network issues...)? >> Not that I'm aware, unless there are firewall timeouts between the >> application and the node servers. Let me find out. The cassandra log > files >> have no errors reported. >> What versions of Hector and Cassandra are you running? >> Cassandra 0.8.1, Hector 0.8.0-1 > > > Our issue is occurring with Cassandra 0.7.8 and Hector 0.7-30. We plan > to deploy Hector 0.7-31 this week and to turn on useSocketKeepalive. > Are you using that? We're also using tcpdump to capture packets when > failures occur to see if there are anomalies in the network traffic. > > Jim > > > >> >> >> >> On Tue, Aug 2, 2011 at 10:37 AM, Jim Ancona <j...@anconafamily.com> > wrote: >>> >>> On Tue, Aug 2, 2011 at 4:36 PM, Anthony Ikeda >>> <anthony.ikeda....@gmail.com> wrote: >>>> I'm not sure if this is a problem with Hector or with Cassandra. >>>> We seem to be seeing broken pipe issues with our connections on > the >>>> client >>>> side (Exception below). A bit of googling finds possibly a problem > with >>>> the >>>> amount of data we are trying to store, although I'm certain our > datasets >>>> are >>>> not all that large. >>> >>> I'm not sure what you're referring to here. Large requests could > lead >>> to timeouts, but that's not what you're seeing here. Could you link > to >>> the page you're referencing? >>> >>>> A nodetool ring command doesn't seem to present any downed nodes: >>>> Address DC Rack Status State Load >>>> Owns >>>> Token >>>> >>>> 153951716904446304929228999025275230571 >>>> 10.130.202.34 datacenter1 rack1 Up Normal 470.74 KB >>>> 79.19% 118538200848404459763384037192174096102 >>>> 10.130.202.35 datacenter1 rack1 Up Normal 483.63 KB >>>> 20.81% 153951716904446304929228999025275230571 >>>> >>>> There are no errors in the cassandra server logs. >>>> >>>> Are there any particular timeouts on connections that we need to > be >>>> aware >>>> of? Or perhaps configure on the Cassandra nodes? Is this purely > and >>>> issue >>>> with the Hector API configuration? >>> >>> There is a server side timeout (rpc_timeout_in_ms in cassandra.yaml) >>> and a Hector client-side timeout >>> (CassandraHostConfigurator.cassandraThriftSocketTimeout). But again, >>> the "Broken pipe" error is not a timeout, it indicates that > something >>> happened to the underlying network socket. For example you will see >>> those when a server node is restarted. >>> >>> Some questions that might help troubleshoot this: >>> How often are these occurring? >>> Does this affect both nodes in the cluster or just one? >>> Are the exceptions related to any external events (e.g. node > restarts, >>> network issues...)? >>> What versions of Hector and Cassandra are you running? >>> >>> Keep in mind that failures like this will normally be retried by >>> Hector, resulting in no loss of data. For that reason, I think that >>> exception is logged as a warning in the newest Hector versions. >>> >>> We've seen something similar, but more catastrophic because it > affects >>> connectivity to the entire cluster, not just a single node. See this >>> post for more details: http://goo.gl/hrgkw So far we haven't >>> identified the cause. >>> >>> Jim >>> >>>> Anthony >>>> >>>> 2011-08-02 08:43:06,541 ERROR >>>> [me.prettyprint.cassandra.connection.HThriftClient] - Could not > flush >>>> transport (to be expected if the pool is shutting down) in close > for >>>> client: >>>> CassandraClient<cassandradevrk1:9393-33> >>>> org.apache.thrift.transport.TTransportException: >>>> java.net.SocketException: >>>> Broken pipe >>>> at >>>> >>>> > org.apache.thrift.transport.TIOStreamTransport.write(TIOStreamTransport.java:147) >>>> at >>>> >>>> > org.apache.thrift.transport.TFramedTransport.flush(TFramedTransport.java:156) >>>> at >>>> >>>> > me.prettyprint.cassandra.connection.HThriftClient.close(HThriftClient.java:85) >>>> at >>>> >>>> > me.prettyprint.cassandra.connection.HConnectionManager.operateWithFailover(HConnectionManager.java:232) >>>> at >>>> >>>> > me.prettyprint.cassandra.service.KeyspaceServiceImpl.operateWithFailover(KeyspaceServiceImpl.java:131) >>>> at >>>> >>>> > me.prettyprint.cassandra.service.KeyspaceServiceImpl.getSlice(KeyspaceServiceImpl.java:289) >>>> at >>>> >>>> > me.prettyprint.cassandra.model.thrift.ThriftSliceQuery$1.doInKeyspace(ThriftSliceQuery.java:53) >>>> at >>>> >>>> > me.prettyprint.cassandra.model.thrift.ThriftSliceQuery$1.doInKeyspace(ThriftSliceQuery.java:49) >>>> at >>>> >>>> > me.prettyprint.cassandra.model.KeyspaceOperationCallback.doInKeyspaceAndMeasure(KeyspaceOperationCallback.java:20) >>>> at >>>> >>>> > me.prettyprint.cassandra.model.ExecutingKeyspace.doExecute(ExecutingKeyspace.java:85) >>>> at >>>> >>>> > me.prettyprint.cassandra.model.thrift.ThriftSliceQuery.execute(ThriftSliceQuery.java:48) >>>> at >>>> >>>> > com.wsgc.services.registry.persistenceservice.impl.cassandra.strategy.read.StandardFindRegistryPersistenceStrategy.findRegistryByProfileId(StandardFindRegistryPersistenceStrategy.java:237) >>>> at >>>> >>>> > com.wsgc.services.registry.persistenceservice.impl.cassandra.strategy.read.StandardFindRegistryPersistenceStrategy.execute(StandardFindRegistryPersistenceStrategy.java:277) >>>> at >>>> >>>> > com.wsgc.services.registry.registryservice.impl.service.StandardRegistryService.getRegistriesByProfileId(StandardRegistryService.java:327) >>>> at >>>> >>>> > com.wsgc.services.registry.webapp.impl.RegistryServicesController.getRegistriesByProfileId(RegistryServicesController.java:247) >>>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native > Method) >>>> at >>>> >>>> > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) >>>> at >>>> >>>> > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) >>>> at java.lang.reflect.Method.invoke(Method.java:597) >>>> at >>>> >>>> > org.springframework.web.bind.annotation.support.HandlerMethodInvoker.invokeHandlerMethod(HandlerMethodInvoker.java:175) >>>> at >>>> >>>> > org.springframework.web.servlet.mvc.annotation.AnnotationMethodHandlerAdapter.invokeHandlerMethod(AnnotationMethodHandlerAdapter.java:421) >>>> at >>>> >>>> > org.springframework.web.servlet.mvc.annotation.AnnotationMethodHandlerAdapter.handle(AnnotationMethodHandlerAdapter.java:409) >>>> at >>>> >>>> > org.springframework.web.servlet.DispatcherServlet.doDispatch(DispatcherServlet.java:774) >>>> at >>>> >>>> > org.springframework.web.servlet.DispatcherServlet.doService(DispatcherServlet.java:719) >>>> at >>>> >>>> > org.springframework.web.servlet.FrameworkServlet.processRequest(FrameworkServlet.java:644) >>>> at >>>> >>>> > org.springframework.web.servlet.FrameworkServlet.doGet(FrameworkServlet.java:549) >>>> at > javax.servlet.http.HttpServlet.service(HttpServlet.java:617) >>>> at > javax.servlet.http.HttpServlet.service(HttpServlet.java:717) >>>> at >>>> >>>> > org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:290) >>>> at >>>> >>>> > org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206) >>>> at >>>> >>>> > org.springframework.web.filter.HiddenHttpMethodFilter.doFilterInternal(HiddenHttpMethodFilter.java:77) >>>> at >>>> >>>> > org.springframework.web.filter.OncePerRequestFilter.doFilter(OncePerRequestFilter.java:76) >>>> at >>>> >>>> > org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235) >>>> at >>>> >>>> > org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206) >>>> at >>>> >>>> > org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233) >>>> at >>>> >>>> > org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191) >>>> at >>>> >>>> > org.apache.catalina.authenticator.AuthenticatorBase.invoke(AuthenticatorBase.java:563) >>>> at >>>> >>>> > org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127) >>>> at >>>> >>>> > org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102) >>>> at >>>> >>>> > org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109) >>>> at >>>> >>>> > org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:298) >>>> at >>>> > org.apache.jk.server.JkCoyoteHandler.invoke(JkCoyoteHandler.java:190) >>>> at >>>> > org.apache.jk.common.HandlerRequest.invoke(HandlerRequest.java:291) >>>> at >>>> org.apache.jk.common.ChannelSocket.invoke(ChannelSocket.java:774) >>>> at >>>> >>>> > org.apache.jk.common.ChannelSocket.processConnection(ChannelSocket.java:703) >>>> at >>>> >>>> > org.apache.jk.common.ChannelSocket$SocketConnection.runIt(ChannelSocket.java:896) >>>> at >>>> >>>> > org.apache.tomcat.util.threads.ThreadPool$ControlRunnable.run(ThreadPool.java:690) >>>> at java.lang.Thread.run(Thread.java:662) >>>> Caused by: java.net.SocketException: Broken pipe >>>> at java.net.SocketOutputStream.socketWrite0(Native Method) >>>> at >>>> > java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:92) >>>> at > java.net.SocketOutputStream.write(SocketOutputStream.java:136) >>>> at >>>> >>>> > org.apache.thrift.transport.TIOStreamTransport.write(TIOStreamTransport.java:145) >>>> ... 47 more >>>> 2011-08-02 08:43:06,543 ERROR >>>> [me.prettyprint.cassandra.connection.HConnectionManager] - MARK > HOST AS >>>> DOWN >>>> TRIGGERED for host cassandradevrk1(10.130.202.34):9393 >>>> 2011-08-02 08:43:06,543 ERROR >>>> [me.prettyprint.cassandra.connection.HConnectionManager] - Pool > state on >>>> shutdown: >>>> >>>> > <ConcurrentCassandraClientPoolByHost>:{cassandradevrk1(10.130.202.34):9393}; >>>> IsActive?: true; Active: 1; Blocked: 0; Idle: 15; > NumBeforeExhausted: 49 >>>> 2011-08-02 08:43:06,543 ERROR >>>> [me.prettyprint.cassandra.connection.ConcurrentHClientPool] - > Shutdown >>>> triggered on >>>> >>>> > <ConcurrentCassandraClientPoolByHost>:{cassandradevrk1(10.130.202.34):9393} >>>> 2011-08-02 08:43:06,544 ERROR >>>> [me.prettyprint.cassandra.connection.ConcurrentHClientPool] - > Shutdown >>>> complete on >>>> >>>> > <ConcurrentCassandraClientPoolByHost>:{cassandradevrk1(10.130.202.34):9393} >>>> 2011-08-02 08:43:06,544 INFO >>>> [me.prettyprint.cassandra.connection.CassandraHostRetryService] - > Host >>>> detected as down was added to retry queue: >>>> cassandradevrk1(10.130.202.34):9393 >>>> 2011-08-02 08:43:06,544 WARN >>>> [me.prettyprint.cassandra.connection.HConnectionManager] - Could > not >>>> fullfill request on this host > CassandraClient<cassandradevrk1:9393-33> >>>> >> >> >