Thanks, Aaron, for your reply - please see the inline.
On 24 Aug 2012, at 11:04, aaron morton wrote: >> - we are running on production linux VMs (not ideal but this is out of our >> hands) > Is the VM doing anything wacky with the IO ? Could be. But I thought we would ask here first. This is a bit difficult to prove cos we dont have the control over these VMs. > > >> As part of a DR exercise, we killed all 6 nodes in DC1, > Nice disaster. Out of interest, what was the shutdown process ? Brutally. kill -9. > >> We noticed that data that was written an hour before the exercise, around >> the last memtables being flushed,was not found in DC1. > To confirm, data was written to DC 1 at CL LOCAL_QUORUM before the DR > exercise. > > Was the missing data written before or after the memtable flush ? I'm trying > to understand if the data should have been in the commit log or the > memtables. Missing data was those written after the last flush. These data was retrievable before the DR exercise. > > Can you provide some more info on how you are detecting it is not found in DC > 1? > We tried hector, consistencylevel=local quorum. We had missing column or the whole row. We tried cassandra-cli on DC1 nodes, same. However once we run the same query on DC2, C* must have then done a read-repair. That particular piece of result data would appear in DC1 again. >> If we understand correctly, commit logs are being written first and then to >> disk every 10s. > Writes are put into a bounded queue and processed as fast as the IO can keep > up. Every 10s a sync messages is added to the queue. Not that the commit log > segment may rotate at any time which requires a sync. > > A loss of data across all nodes in a DC seems odd. If you can provide some > more information we may be able to help. We are wondering if the fsync of the commit log was working. But we saw no errors / warning in logs. Wondering if there is way to verify.... > > Cheers > > ----------------- > Aaron Morton > Freelance Developer > @aaronmorton > http://www.thelastpickle.com > > On 24/08/2012, at 6:01 AM, rubbish me <rubbish...@googlemail.com> wrote: > >> Hi all >> >> First off, let's introduce the setup. >> >> - 6 x C* 1.1.2 in active DC (DC1), another 6 in another (DC2) >> - keyspace's RF=3 in each DC >> - Hector as client. >> - client talks only to DC1 unless DC1 can't serve the request. In which case >> talks only to DC2 >> - commit log was periodically sync with the default setting of 10s. >> - consistency policy = LOCAL QUORUM for both read and write. >> - we are running on production linux VMs (not ideal but this is out of our >> hands) >> ----- >> As part of a DR exercise, we killed all 6 nodes in DC1, hector starts >> talking to DC2, all the data was still there, everything continued to work >> perfectly. >> >> Then we brought all nodes, one by one, in DC1 up. We saw a message saying >> all the commit logs were replayed. No errors reported. We didn't run repair >> at this time. >> >> We noticed that data that was written an hour before the exercise, around >> the last memtables being flushed,was not found in DC1. >> >> If we understand correctly, commit logs are being written first and then to >> disk every 10s. At worst we lost the last 10s of data. What could be the >> cause of this behaviour? >> >> With the blessing of C* we could recovered all these data from DC2. But we >> would like to understand why. >> >> Many thanks in advanced. >> >> Amy >> >> >