Thanks, Aaron, for your reply - please see the inline.

On 24 Aug 2012, at 11:04, aaron morton wrote:

>> - we are running on production linux VMs (not ideal but this is out of our 
>> hands)
> Is the VM doing anything wacky with the IO ?

Could be.  But I thought we would ask here first.  This is a bit difficult to 
prove cos we dont have the control over these VMs.

>  
> 
>> As part of a DR exercise, we killed all 6 nodes in DC1,
> Nice disaster. Out of interest, what was the shutdown process ?

Brutally. kill -9.


> 
>> We noticed that data that was written an hour before the exercise, around 
>> the last memtables being flushed,was not found in DC1. 
> To confirm, data was written to DC 1 at CL LOCAL_QUORUM before the DR 
> exercise. 
> 
> Was the missing data written before or after the memtable flush ? I'm trying 
> to understand if the data should have been in the commit log or the 
> memtables. 

Missing data was those written after the last flush.  These data was 
retrievable before the DR exercise.

> 
> Can you provide some more info on how you are detecting it is not found in DC 
> 1?
> 

We tried hector, consistencylevel=local quorum.  We had missing column or the 
whole row.  

We tried cassandra-cli on DC1 nodes, same.

However once we run the same query on DC2, C* must have then done a 
read-repair. That particular piece of result data would appear in DC1 again.


>> If we understand correctly, commit logs are being written first and then to 
>> disk every 10s. 
> Writes are put into a bounded queue and processed as fast as the IO can keep 
> up. Every 10s a sync messages is added to the queue. Not that the commit log 
> segment may rotate at any time which requires a sync. 
> 
> A loss of data across all nodes in a DC seems odd. If you can provide some 
> more information we may be able to help. 


We are wondering if the fsync of the commit log was working.  But we saw no 
errors / warning in logs.  Wondering if there is way to verify....


> 
> Cheers
> 
> -----------------
> Aaron Morton
> Freelance Developer
> @aaronmorton
> http://www.thelastpickle.com
> 
> On 24/08/2012, at 6:01 AM, rubbish me <rubbish...@googlemail.com> wrote:
> 
>> Hi all
>> 
>> First off, let's introduce the setup. 
>> 
>> - 6 x C* 1.1.2 in active DC (DC1), another 6 in another (DC2)
>> - keyspace's RF=3 in each DC
>> - Hector as client.
>> - client talks only to DC1 unless DC1 can't serve the request. In which case 
>> talks only to DC2
>> - commit log was periodically sync with the default setting of 10s. 
>> - consistency policy = LOCAL QUORUM for both read and write. 
>> - we are running on production linux VMs (not ideal but this is out of our 
>> hands)
>> -----
>> As part of a DR exercise, we killed all 6 nodes in DC1, hector starts 
>> talking to DC2, all the data was still there, everything continued to work 
>> perfectly. 
>> 
>> Then we brought all nodes, one by one, in DC1 up. We saw a message saying 
>> all the commit logs were replayed. No errors reported.  We didn't run repair 
>> at this time. 
>> 
>> We noticed that data that was written an hour before the exercise, around 
>> the last memtables being flushed,was not found in DC1. 
>> 
>> If we understand correctly, commit logs are being written first and then to 
>> disk every 10s. At worst we lost the last 10s of data. What could be the 
>> cause of this behaviour? 
>> 
>> With the blessing of C* we could recovered all these data from DC2. But we 
>> would like to understand why. 
>> 
>> Many thanks in advanced. 
>> 
>> Amy
>> 
>> 
> 

Reply via email to