As a side effect of the failed repair (so it seems) the disk usage on the
affected node prevents compaction from working. It still works on
the remaining nodes (we have 3 total).
Is there a way to scrub the extraneous data?
Thanks
Maxim
On 12/4/2011 4:29 PM, Peter Schuller wrote:
I will try to increase phi_convict -- I will just need to restart the
cluster after
the edit, right?
You will need to restart the nodes for which you want the phi convict
threshold to be different. You might want to do on e.g. half of the
cluster to do A/B testing.
I do recall that I see nodes temporarily marked as down, only to pop up
later.
I recommend grepping through the logs on all the clusters (e.g., cat
/var/log/cassandra/cassandra.log | grep UP | wc -l). That should tell
you quickly whether they all seem to be seeing roughly as many node
flaps, or whether some particular node or set of nodes is/are
over-represented.
Next, look at the actual nodes flapping (remove wc -l) and see if all
nodes are flapping or if it is a single node, or a subset of the nodes
(e.g., sharing a switch perhaps).
In the current situation, there is no load on the cluster at all, outside
the
maintenance like the repair.
Ok. So what i'm getting at then is that there may be real legitimate
connectivity problems that you aren't noticing in any other way since
you don't have active traffic to the cluster.