On 11/12/2010 6:46 PM, Jonathan Ellis wrote:
> On Fri, Nov 12, 2010 at 3:19 PM, Chip Salzenberg <rev.c...@gmail.com> wrote:
>> After I rebooted my 0.7.0beta3+ cluster to increase threads (read=100
>> write=200 ... they're beefy machines), and putting them under load again, I
>> find gossip reporting yoyo up-down-up-down status for the other nodes.
>>  Anyone know what this is a symptom of, and/or how to avoid it?
> It means "the system is too overloaded to process gossip data in a
> timely manner."  Usually this means GC storming but that does not like
> the problem here.  Swapping is a less frequent offender.

The system is not overloaded in the sense of load average; but disk I/O
was and is heavy (write load then, repair now).  Two nodes are streaming
(because one is repairing), and there are some compactions, but the
cluster is almost idle otherwise.  Swapping could conceivably be a
factor; the JVM is 32G out of 72G, but the machine is 2.5G into swap
anyway.  I'm going to disable swap and see if the gossip issues resolve.

Perhaps 200 is a bit too high on the threads, despite the presence of
eight fast true cores plus hyperthreading?

>   Since you
> are seeing this after bumping to extremely high thread counts I would
> guess context switching might be a factor.
>
> What are tpstats?

I ran the thread count up because the mutate events pending was very
high -- that was what led to the dropped mutates, I assumed.  It did
help; the tpstats are staying low now.  For example, the node that's
repairing has this:

Pool Name                    Active   Pending      Completed
ReadStage                         0         0              4
Request_responseStage             0         0      394392313
MutationStage                     0         0      422750725
ReadRepair                        0         0              0
GossipStage                       0         0         291951
AntientropyStage                  0         0              5
MigrationStage                    0         0              0
MemtablePostFlusher               0         0             61
StreamStage                       0         0              0
Internal_responseStage            0         0              0
FlushWriter                       0         0             61
FILEUTILS-DELETE-POOL             0         0            728
MiscStage                         0         0             14
FlushSorter                       0         0              0
HintedHandoff                     1         1             18

(the HintedHandoff numbers are nonzero on at least two nodes, and are
not resolving; and all nodes are up.  Odd, but probably harmless. (?))

>>  I haven't
>> seen any particular symptoms other than the log messages; and I suppose I'm
>> also dropping replication MUTATEs which had been happening already, anyway.
> I don't see any WARN lines about that, did you elide them?

No; this part of my message is badly written, sorry.  The dropped
MUTATES were the motivation for increasing the thread count, and are
gone AFAICT.

Reply via email to