Hi: These days I found my Cassandra is strange, much slower than before. And I Spent much time to figure it out and today I got the answer.
Some bad buy keeps on writing many data day and night, then made a very big row mutation which size is about 140M. In this period I restarted some Cassandra nodes, and when the nodes is alive again, them got some hintedhandoff messages. HintedHandOffManager.sendMessage() will send the rowmutations to these nodes, but the rowmutation is too big to finish transferring in 8 seconds (defined in DatabaseDescriptor.getRpcTimeout()), and sendMessage() return false when got a TimeoutException. Every one hour HintedHandOffManager will check hintedhandoff ColumnFamily then send out the big rowmutations to alive nodes, It fails again because of the TimeoutException, so the task will never finish and the big rowmutation is sending again and again. In multi-datacenters, a big rowmutation can not be transferred in several seconds. so It is a potential risk when a big rowmutation occurs. Luke