I did consider doubling down and replicating both Kafka and Cassandra to the secondary DC. It seemed a bit complicated (term used relatively), and I didn't want to think about the unlikely scenario of Cassandra writes getting across before the Kafka ones. Inserting everything in Kafka into Cassandra after a failure is easy. Removing everything from Cassandra that -isn't- in Kafka is not a problem I want to take a swing at if I don't have to.
On Mon, Dec 14, 2015 at 4:02 PM, Jeff Jirsa <jeff.ji...@crowdstrike.com> wrote: > Emit a message to a new kafka topic once the first write is persisted into > cassandra with LOCAL_QUORUM (gives you low latency), then consume off of > that topic to get higher-latency-but-causally-correct writes to subsequent > (disconnected) DR DC. > > > > From: Philip Persad > Reply-To: "user@cassandra.apache.org" > Date: Monday, December 14, 2015 at 3:37 PM > > To: Cassandra Users > Subject: Re: Replicating Data Between Separate Data Centres > > Hi Jeff, > > You're dead on with that article. That is a very good explanation of the > problem I'm facing. You're also right that, fascinating though that > research is, letting it anywhere near my production data is not something > I'd think about. > > Basically, I want EACH_QUORUM, but I'm not willing to pay for it. My > system needs to be reasonably close to a real-time system (let's say a soft > real-time system). Waiting for each write to make its way across a > continent is not something I can live with (to say nothing about what > happens if the WAN temporarily fails). > > Basically I guess what I'm hearing is that the best way to create a clone > of a Cassandra cluster in another DC is to snapshot and restore. > > Thanks! > > -Phil > > On Mon, Dec 14, 2015 at 3:18 PM, Jeff Jirsa <jeff.ji...@crowdstrike.com> > wrote: > >> >> There is research into causal consistency and cassandra ( >> http://da-data.blogspot.com/2013/02/caring-about-causality-now-in-cassandra.html >> , >> for example), though you’ll note that it uses a fork ( >> https://github.com/wlloyd/eiger ) which is unlikely something you’d ever >> want to consider in production. Let’s pretend like it doesn’t exist, and >> won’t in the near future. >> >> The typical approach here is to have multiple active datacenters and >> EACH_QUORUM writes, which gives you the ability to have a full DC failure >> without impact. This also solves your fail-back problem, because when the >> primary DC is restored, you simply run a repair. What part of EACH_QUORUM >> is insufficient for your needs? The failure scenarios when the WAN link >> breaks and it impacts local writes? >> >> Short of that, your ‘occasional snapshots and restore in case of >> emergency’ is going to be your next-best-thing. >> >> >> From: Philip Persad >> Reply-To: "user@cassandra.apache.org" >> Date: Monday, December 14, 2015 at 3:11 PM >> To: Cassandra Users >> Subject: Re: Replicating Data Between Separate Data Centres >> >> Hi Jim, >> >> Thanks for taking the time to answer. By Causal Consistency, what I mean >> is that I need strict ordering of all related events which might have a >> causal relationship. For example (albeit slightly contrived), if we are >> looking at recording an event stream, it is very important that the event >> creating a user be visible before the event which assigns a permissions to >> a user. However, I don't care at all about the ordering of the creation of >> two different users. This is what I mean by Causal Consistency. >> >> This reason why LOCAL_QUORUM replication does not work for me, is >> because, while I can get ordering guarantees about the order in which >> writes will become visible in the Primary DC, I cannot get those guarantees >> about the Secondary DC. As a result (to user another slightly contrived >> example), if a user is created and then takes an action shortly before the >> failure of the Primary DC, there are four possible situations with respect >> to what will be visible in the Secondary DC: >> >> 1) Both events are visible in the Secondary DC >> 2) Neither event will be visible in the Secondary DC >> 3) The creation event is visible in the Secondary DC, but the action >> event is not >> 4) The action event is visible Secondary DC, but the creation event is not >> >> States 1, 2, and 3 are all acceptable. State 4 is not. However, if I >> understand Cassandra asynchronous DC replication correctly, I do not >> believe I get any guarantees that situation 4 will not happen. Eventual >> Consistency promises to "eventually" settle into State 1. However >> "eventually" does me very little good if Godzilla steps on my Primary DC. >> I'm willing to accept loss of data which was created near to a disaster >> (States 2 and 3), but I cannot accept the inconsistent history of events in >> State 4. >> >> I have a mechanism outside of normal Cassandra replication which can give >> me the consistency I need. My problem is effectively with setting up a new >> recovery DC after the failure of the primary. How do I go about getting >> all of my data into a new, cluster? >> >> Thanks, >> >> -Phil >> >> On Mon, Dec 14, 2015 at 1:06 PM, Jim Ancona <j...@anconafamily.com> wrote: >> >>> Could you define what you mean by Casual Consistency and explain why >>> you think you won't have that when using LOCAL_QUORUM? I ask because >>> LOCAL_QUORUM >>> and multiple data centers are the way many of us handle DR, so I'd like to >>> understand why it doesn't work for you. >>> >>> I'm afraid I don't understand your scenario. Are you planning on >>> building out a new recovery DC *after* the primary has failed, or keeping >>> two DCs in sync so that you can switch over after a failure? >>> >>> Jim >>> >>> On Mon, Dec 14, 2015 at 2:59 PM, Philip Persad <philip.per...@gmail.com> >>> wrote: >>> >>>> Hi, >>>> >>>> I'm currently looking at Cassandra in the context of Disaster >>>> Recovery. I have 2 Data Centres, one is the Primary and the other acts as >>>> a Standby. There is a Cassandra cluster in each Data Centre. For the time >>>> being I'm running Cassandra 2.0.9. Unfortunately, due to the nature of my >>>> data, the consistency levels that I would get out of LOCAL_QUORUM writes >>>> followed by asynchronous replication to the secondary data centre are >>>> insufficient. In the event of a failure, it is acceptable to lose some >>>> data, but I need Casual Consistency to be maintained. Since I don't have >>>> the luxury of performing nodetool repairs after Godzilla steps on my >>>> primary data centre, I use more strictly ordered means of transporting >>>> events between the Data Centres (Kafka for anyone who cares about that >>>> detail). >>>> >>>> What I'm not sure about, is how to go about copying all the data in one >>>> Cassandra cluster to a new cluster, either to bring up a new Standby Data >>>> Centre or as part of failing back to the Primary after I pick up the >>>> pieces. I'm thinking that I should either: >>>> >>>> 1. Do a snapshot ( >>>> https://docs.datastax.com/en/cassandra/2.0/cassandra/operations/ops_backup_takes_snapshot_t.html), >>>> and then restore that snapshot on my new cluster ( >>>> https://docs.datastax.com/en/cassandra/2.0/cassandra/operations/ops_snapshot_restore_new_cluster.html >>>> ) >>>> >>>> 2. Join the new data centre to the existing cluster ( >>>> https://docs.datastax.com/en/cassandra/2.0/cassandra/operations/ops_add_dc_to_cluster_t.html). >>>> Then separate the two data centres into two individual clusters by doing . >>>> . . something??? >>>> >>>> Does anyone have any advice about how to tackle this problem? >>>> >>>> Many thanks, >>>> >>>> -Phil >>>> >>> >>> >> >