On 11/12/2010 6:46 PM, Jonathan Ellis wrote: > On Fri, Nov 12, 2010 at 3:19 PM, Chip Salzenberg <rev.c...@gmail.com> wrote: >> After I rebooted my 0.7.0beta3+ cluster to increase threads (read=100 >> write=200 ... they're beefy machines), and putting them under load again, I >> find gossip reporting yoyo up-down-up-down status for the other nodes. >> Anyone know what this is a symptom of, and/or how to avoid it? > It means "the system is too overloaded to process gossip data in a > timely manner." Usually this means GC storming but that does not like > the problem here. Swapping is a less frequent offender.
The system is not overloaded in the sense of load average; but disk I/O was and is heavy (write load then, repair now). Two nodes are streaming (because one is repairing), and there are some compactions, but the cluster is almost idle otherwise. Swapping could conceivably be a factor; the JVM is 32G out of 72G, but the machine is 2.5G into swap anyway. I'm going to disable swap and see if the gossip issues resolve. Perhaps 200 is a bit too high on the threads, despite the presence of eight fast true cores plus hyperthreading? > Since you > are seeing this after bumping to extremely high thread counts I would > guess context switching might be a factor. > > What are tpstats? I ran the thread count up because the mutate events pending was very high -- that was what led to the dropped mutates, I assumed. It did help; the tpstats are staying low now. For example, the node that's repairing has this: Pool Name Active Pending Completed ReadStage 0 0 4 Request_responseStage 0 0 394392313 MutationStage 0 0 422750725 ReadRepair 0 0 0 GossipStage 0 0 291951 AntientropyStage 0 0 5 MigrationStage 0 0 0 MemtablePostFlusher 0 0 61 StreamStage 0 0 0 Internal_responseStage 0 0 0 FlushWriter 0 0 61 FILEUTILS-DELETE-POOL 0 0 728 MiscStage 0 0 14 FlushSorter 0 0 0 HintedHandoff 1 1 18 (the HintedHandoff numbers are nonzero on at least two nodes, and are not resolving; and all nodes are up. Odd, but probably harmless. (?)) >> I haven't >> seen any particular symptoms other than the log messages; and I suppose I'm >> also dropping replication MUTATEs which had been happening already, anyway. > I don't see any WARN lines about that, did you elide them? No; this part of my message is badly written, sorry. The dropped MUTATES were the motivation for increasing the thread count, and are gone AFAICT.