A couple of timeouts should have kicked in. 

First the rpc_timeout on the server side should have kicked in and given the 
client a (thrift) TimedOutException. Secondly a client side socket timeout 
should be set so the client will timeout the socket. Did either of these appear 
in the client side logs?

In response to either of those my guess would be that hector would cycle the 
connection. (I've not checked this.)

How did the disk fail ? Was their anything in the server logs ? 

Some background about handling disk fails 
https://issues.apache.org/jira/browse/CASSANDRA-809

Cheers

-----------------
Aaron Morton
Freelance Cassandra Developer
@aaronmorton
http://www.thelastpickle.com

On 1 Aug 2011, at 08:13, Lior Golan wrote:

> In one of our test clusters we had a damaged commit log disks in one of the 
> nodes.
>  
> We have replication factor = 2 in this cluster, and write with consistency 
> level = ONE. So we expected writes will not be affected by such an issue. But 
> what actually happened is that the client that was writing with CL.ONE got 
> stuck. The client could resume writing when we stopped the server with the 
> faulty disk (so this is another indication it's not a replication factor or 
> consistency level issue).
>  
> We are running Cassandra 0.7.6, and the client we're using is Hector.
>  
> Can anyone explain what happened here? Why the client got stuck when the 
> commit log disk on one of the servers damaged (and could resume writing if we 
> actually took off that server)?

Reply via email to