Fwiw, similar to another issue of stuck compaction that was on the list
several days ago, if I cleared out the hints, either by removing files
while node was down, or running a scrub on system.hints during node
startup, I was able to get these compactions cleared, an the nodes are
starting to get caught up on tasks that had been blocked.

Nate, there are definiately a number of things that could be hitting the
9160 port... but I was seeing the transport size error even between nodes
(and there was nothing runnining on any node other than C*)... switching
back to sync and no longer get that error.


On Wed, Aug 7, 2013 at 2:58 PM, Nate McCall <zznat...@gmail.com> wrote:

> Is there anything else on the network that could be attempting to
> connect to 9160?
>
> That is the exact error you would get when someone initiates a
> connection and sends a null byte. You can reproduce it thusly:
> echo -n 'm' | nc localhost 9160
>
>
> On Wed, Aug 7, 2013 at 11:11 AM, David McNelis <dmcne...@gmail.com> wrote:
> > Nate,
> >
> > We had a node that was flaking on us last week and had a lot of handoffs
> > fail to that node.  We ended up decommissioning that node entirely.  I
> can't
> > find the actual error we were getting at the time (logs have been rotated
> > out), but currently we're not seeing any errors there.
> >
> > We haven't had any schema updates recently and we are using the sync rpc
> > server.  We had hsha turned on for a while, but we were getting a bunch
> of
> > transport frame size errors.
> >
> >
> > On Wed, Aug 7, 2013 at 1:55 PM, Nate McCall <zznat...@gmail.com> wrote:
> >>
> >> Thrift and ClientState are both unrelated to hints.
> >>
> >> What do you see in the logs after "Started hinted handoff for
> >> host:..." from HintedHandoffManager?
> >>
> >> It should either have an error message or something along the lines of
> >> "Finished hinted handoff of:..."
> >>
> >> Where there any schema updates that preceded this happening?
> >>
> >> As for the thrift stuff, which rpc_server_type are you using?
> >>
> >>
> >>
> >> On Wed, Aug 7, 2013 at 6:14 AM, David McNelis <dmcne...@gmail.com>
> wrote:
> >> > Morning folks,
> >> >
> >> > For the last couple of days all of my nodes (17, all running 1.2.8)
> have
> >> > been stuck at various percentages of completion for compacting
> >> > system.hints.
> >> > I've tried restarting the nodes (including a full rolling restart of
> the
> >> > cluster) to no avail.
> >> >
> >> > When I turn on Debugging I am seeing this error on all of the nodes
> >> > constantly:
> >> >
> >> > DEBUG 09:03:21,999 Thrift transport error occurred during processing
> of
> >> > message.
> >> > org.apache.thrift.transport.TTransportException
> >> >         at
> >> >
> >> >
> org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:132)
> >> >         at
> >> > org.apache.thrift.transport.TTransport.readAll(TTransport.java:84)
> >> >         at
> >> >
> >> >
> org.apache.thrift.transport.TFramedTransport.readFrame(TFramedTransport.java:129)
> >> >         at
> >> >
> >> >
> org.apache.thrift.transport.TFramedTransport.read(TFramedTransport.java:101)
> >> >         at
> >> > org.apache.thrift.transport.TTransport.readAll(TTransport.java:84)
> >> >         at
> >> >
> >> >
> org.apache.thrift.protocol.TBinaryProtocol.readAll(TBinaryProtocol.java:378)
> >> >         at
> >> >
> >> >
> org.apache.thrift.protocol.TBinaryProtocol.readI32(TBinaryProtocol.java:297)
> >> >         at
> >> >
> >> >
> org.apache.thrift.protocol.TBinaryProtocol.readMessageBegin(TBinaryProtocol.java:204)
> >> >         at
> >> > org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:22)
> >> >         at
> >> >
> >> >
> org.apache.cassandra.thrift.CustomTThreadPoolServer$WorkerProcess.run(CustomTThreadPoolServer.java:199)
> >> >         at
> >> >
> >> >
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> >> >         at
> >> >
> >> >
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> >> >         at java.lang.Thread.run(Thread.java:724)
> >> >
> >> >
> >> > When I turn on tracing, I see that shortly after this error there is a
> >> > message similar to:
> >> > TRACE 09:03:22,000 ClientState removed for socket addr
> >> > /10.55.56.211:35431
> >> >
> >> > The IP in this message is sometimes a client machine, sometimes
> another
> >> > cassandra node with no processes other than C* running on it (which I
> >> > think
> >> > rules out an issue with a particular client library doing something
> >> > funny
> >> > with Thrift).
> >> >
> >> > While I wouldn't expect a Thrift issue to cause problems with
> >> > compaction,
> >> > I'm out of other ideas at the moment.  Anyone have any thoughts they
> >> > could
> >> > share?
> >> >
> >> > Thanks,
> >> > David
> >
> >
>

Reply via email to