Re: RMI/JMX errors, weird

Maxim Potekhin Tue, 24 Apr 2012 11:46:42 -0700

Hello Aaron,

it's probably the over-optimistic number of concurrent compactors thatwas tripping the system.

I do not entirely understand what's the correlation here, maybe it'sthat the compactors were overloadingthe neighboring nodes causing time-outs. I tuned the concurrency downand after a while things seem

to have settled down, thanks for the suggestion.

Maxim


On 4/19/2012 4:13 PM, aaron morton wrote:

1150 pending tasks, and is not
making progress.
Not all pending tasks reported by nodetool compactionstats actuallyrun. Once they get a chance to run the files they were going to workon may have already been compacted.
Given that repair tests at double the phi threshold, it may not makemuch difference.
Did other nodes notice it was dead ? Was there anything in the logthat showed it was under duress (GC or dropped message logs) ?
Is the compaction a consequence of repair ? (The streaming stage canresult in compactions). Or do you think the node is just behind oncompactions ?
If you feel compaction is hurting the node, considersetting concurrent_compactors in the yaml to 2.
You can also isolate the node from updates using nodetooldisablegossip and disablerthrift , and the turn off the IO limiterwith nodetool setcompactionthroughput 0.
Hope that helps.
-----------------
Aaron Morton
Freelance Developer
@aaronmorton
http://www.thelastpickle.com

On 20/04/2012, at 12:29 AM, Maxim Potekhin wrote:
Hello Aaron,
how should I go about fixing that? Also, after a repeated attempt tocompactit goes again into "building secondary index" with 1150 pendingtasks, and is notmaking progress. I suspected the disk system failure, but this needsto be confirmed.
So basically, do I need to tune the phi threshold up? Thing is, therewas no heavy load
on the cluster at all.

Thanks

Maxim




On 4/19/2012 7:06 AM, aaron morton wrote:
At some point the gossip system on the node this log is from decidedthat 130.199.185.195 was DOWN. This was based on how often the nodewas gossiping to the cluster.
The active repair session was informed. And to avoid failing the jobunnecessarily it tested that the errant nodes phi value was twicethe configured phi_convict_threshold. It was and the repair was killed.
Take a look at the logs on 130.199.185.195 and see if anything washappening on the node at the same time. Could be GC or anoverloaded node (it would log about dropped messages).
Perhaps other nodes also saw 130.199.185.195 as down? it only neededto be down for a few seconds.
Hope that helps.

-----------------
Aaron Morton

Re: RMI/JMX errors, weird

Reply via email to