Your ring is wildly unbalanced and you are almost certainly out of I/O
on one or more nodes.  You should be monitoring via JMX and common
systems tools to know when you are starting to have issues.  It is
going to take you some effort to get out of this situation now.


b

On Mon, Sep 27, 2010 at 2:55 PM, Rana Aich <aichr...@gmail.com> wrote:
> Hi Peter,
> Thanks for your detailed query...
> I have 8 m/c cluster. KVSHIGH1,2,3,4 and KVSLOW1,2,3,4. As the name suggests
> KVSLOWs have low diskspace ~ 350GB
>  Whereas KVSHIGHs have 1.5 terabytes.
> Yet my nodetool shows the following:
> 192.168.202.202Down       319.94 GB
> 7200044730783885730400843868815072654      |<--|
> 192.168.202.4 Up         382.39 GB
> 23719654286404067863958492664769598669     |   ^
> 192.168.202.2 Up         106.81 GB
> 36701505058375526444137310055285336988     v   |
> 192.168.202.3 Up         149.81 GB
> 65098486053779167479528707238121707074     |   ^
> 192.168.202.201Up         154.72 GB
> 79420606800360567885560534277526521273     v   |
> 192.168.202.204Up         72.91 GB
>  85219217446418416293334453572116009608     |   ^
> 192.168.202.1 Up         29.78 GB
>  87632302962564279114105239858760976120     v   |
> 192.168.202.203Up         9.35 GB
> 87790520647700936489181912967436646309     |-->|
> As you can see one of our KVSLOW box is already down. Its 100% full. Whereas
> boxes having 1.5 terabytes have only 29.78 GB (192.168.202.1 )! I'm using
> RandomPartitioner. When I run the client program the Cassandra Daemon takes
> around 85-130% CPU.
> Regards,
> Rana
>
>
> On Mon, Sep 27, 2010 at 2:31 PM, Peter Schuller
> <peter.schul...@infidyne.com> wrote:
>>
>> > How can I handle this kind of situation?
>>
>> In terms of surviving the problem, a re-try on the client side might
>> help assuming the problem is temporary.
>>
>> However,  certainly the fact that you're seeing an issue to begin with
>> is interesting, and the way to avoid it would depend on what the
>> problem is. My understanding is that the UnavailableException
>> indicates that the node you are talking to was unable to read
>> form/write to a sufficient number of nodes to satisfy your consistency
>> level. Presumably either because individual requests failed to return
>> in time, or because the node considers other nodes to be flat out
>> down.
>>
>> Can you correlate these issues with server-side activity on the nodes,
>> such as background compaction, commitlog rotation or memtable
>> flushing? Do you see your nodes saying that other nodes in the cluster
>> are "DOWN" and "UP" (flapping)?
>>
>> How large is the data set in total (in terms of sstable size on disk),
>> and how much memory do you have in your machines (going to page
>> cache)?
>>
>> Have you observed the behavior of your nodes during compaction; in
>> particular whether compaction is CPU bound or I/O bound? (That would
>> tend to depend on data; generally the larger the individual values the
>> more disk bound you'd tend to be.)
>>
>> Just trying to zero in on what the likely root cause is in this case.
>>
>> --
>> / Peter Schuller
>
>

Reply via email to