Fwiw, similar to another issue of stuck compaction that was on the list several days ago, if I cleared out the hints, either by removing files while node was down, or running a scrub on system.hints during node startup, I was able to get these compactions cleared, an the nodes are starting to get caught up on tasks that had been blocked.
Nate, there are definiately a number of things that could be hitting the 9160 port... but I was seeing the transport size error even between nodes (and there was nothing runnining on any node other than C*)... switching back to sync and no longer get that error. On Wed, Aug 7, 2013 at 2:58 PM, Nate McCall <zznat...@gmail.com> wrote: > Is there anything else on the network that could be attempting to > connect to 9160? > > That is the exact error you would get when someone initiates a > connection and sends a null byte. You can reproduce it thusly: > echo -n 'm' | nc localhost 9160 > > > On Wed, Aug 7, 2013 at 11:11 AM, David McNelis <dmcne...@gmail.com> wrote: > > Nate, > > > > We had a node that was flaking on us last week and had a lot of handoffs > > fail to that node. We ended up decommissioning that node entirely. I > can't > > find the actual error we were getting at the time (logs have been rotated > > out), but currently we're not seeing any errors there. > > > > We haven't had any schema updates recently and we are using the sync rpc > > server. We had hsha turned on for a while, but we were getting a bunch > of > > transport frame size errors. > > > > > > On Wed, Aug 7, 2013 at 1:55 PM, Nate McCall <zznat...@gmail.com> wrote: > >> > >> Thrift and ClientState are both unrelated to hints. > >> > >> What do you see in the logs after "Started hinted handoff for > >> host:..." from HintedHandoffManager? > >> > >> It should either have an error message or something along the lines of > >> "Finished hinted handoff of:..." > >> > >> Where there any schema updates that preceded this happening? > >> > >> As for the thrift stuff, which rpc_server_type are you using? > >> > >> > >> > >> On Wed, Aug 7, 2013 at 6:14 AM, David McNelis <dmcne...@gmail.com> > wrote: > >> > Morning folks, > >> > > >> > For the last couple of days all of my nodes (17, all running 1.2.8) > have > >> > been stuck at various percentages of completion for compacting > >> > system.hints. > >> > I've tried restarting the nodes (including a full rolling restart of > the > >> > cluster) to no avail. > >> > > >> > When I turn on Debugging I am seeing this error on all of the nodes > >> > constantly: > >> > > >> > DEBUG 09:03:21,999 Thrift transport error occurred during processing > of > >> > message. > >> > org.apache.thrift.transport.TTransportException > >> > at > >> > > >> > > org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:132) > >> > at > >> > org.apache.thrift.transport.TTransport.readAll(TTransport.java:84) > >> > at > >> > > >> > > org.apache.thrift.transport.TFramedTransport.readFrame(TFramedTransport.java:129) > >> > at > >> > > >> > > org.apache.thrift.transport.TFramedTransport.read(TFramedTransport.java:101) > >> > at > >> > org.apache.thrift.transport.TTransport.readAll(TTransport.java:84) > >> > at > >> > > >> > > org.apache.thrift.protocol.TBinaryProtocol.readAll(TBinaryProtocol.java:378) > >> > at > >> > > >> > > org.apache.thrift.protocol.TBinaryProtocol.readI32(TBinaryProtocol.java:297) > >> > at > >> > > >> > > org.apache.thrift.protocol.TBinaryProtocol.readMessageBegin(TBinaryProtocol.java:204) > >> > at > >> > org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:22) > >> > at > >> > > >> > > org.apache.cassandra.thrift.CustomTThreadPoolServer$WorkerProcess.run(CustomTThreadPoolServer.java:199) > >> > at > >> > > >> > > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > >> > at > >> > > >> > > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > >> > at java.lang.Thread.run(Thread.java:724) > >> > > >> > > >> > When I turn on tracing, I see that shortly after this error there is a > >> > message similar to: > >> > TRACE 09:03:22,000 ClientState removed for socket addr > >> > /10.55.56.211:35431 > >> > > >> > The IP in this message is sometimes a client machine, sometimes > another > >> > cassandra node with no processes other than C* running on it (which I > >> > think > >> > rules out an issue with a particular client library doing something > >> > funny > >> > with Thrift). > >> > > >> > While I wouldn't expect a Thrift issue to cause problems with > >> > compaction, > >> > I'm out of other ideas at the moment. Anyone have any thoughts they > >> > could > >> > share? > >> > > >> > Thanks, > >> > David > > > > >