It's great advice, but I'm still torn. I've never done multi-region work before, and I'd prefer to wait for 0.8 with built-in inter-node security, but I'm otherwise ready to roll (and need to roll) cassandra out sooner than that.
Given how well my system held up with a total single AZ failure, I'm really leaning on starting by treating AZ's as DCs, and racks as... random? I don't think that part matters. My question for today is to just use the property file snitch, or to roll my own version of Ec2Snith that does AZ as DC. I do increase my risk being single region to start, so I was going to figure out how to push snapshots to S3. One question on that note: is it better to try and snapshot all nodes at roughly the same point in time, or is it better to do "rolling snapshots"? will On Wed, Apr 27, 2011 at 7:13 AM, aaron morton <aa...@thelastpickle.com>wrote: > Using the EC2Snitch you could have one AZ in us-east-1 and one Az in > us-west-1, treat each AZ as a single rack and each region as a DC. The > network topology is rack aware so will prefer request that go to the same > rack (not much of an issue when you have only one rack). > > If possible I would use the same RF in each DC, if you want the fail over > to be as clean as possible (see earlier comments about when number failed > nodes in a dc). i.e. 3 replicas in each dc / region. > > Until you find a reason otherwise use LOCAL_QUORUM, if that proves to be > too slow or you get more experience and feel comfortable with the trade offs > then change to a lower CL. > > Dropping the CL level for write bursts does not make the cluster run any > faster, it lets the client think the cluster is running faster and can > result in the client overloading (in a good "this is what it should do" way) > the cluster. This can result in more "eventual consistency" work to be done > later during maintenance or read requests. If that is a reasonable trade > off, you can write at CL ONE and read at CL ALL to ensure you get consistent > reads (quorum is not good enough in that case). > > Jump in and test it at Quorum, you may find the write performance is good > enough. There are lots of dials to play with > http://wiki.apache.org/cassandra/MemtableThresholds > > Hope that helps. > Aaron > > > On 27 Apr 2011, at 09:31, William Oberman wrote: > > I see what you're saying. I was able to control write latency on mysql > using insert vs insert delayed (what I feel is MySQLs poor man's eventual > consistency option) + the fact that replication was a background > asynchronous process. In terms of read latency, I was able to do up to a > few hundred well indexed mysql queries (across AZs) on a view while keeping > the overall latency of the page around or less than a second. > > I basically am replacing two use cases, the cases with difficult to scale > anticipated write volumes. The first case was previously using insert > delayed (which I'm doing in cassandra as ONE) as I wasn't getting consistent > write/read operations before anyways. The second case was using traditional > insert (which I was going to replace with some QUORUM-like level, I was > assuming LOCAL_QUORUM). But, the latter case uses a write through memory > cache (memcache), so I don't know how often it really reads data from the > persistent store. But I definitely need to make sure it is consistent. > > In any case, it sounds like I'd be best served treating AZs as DCs, but > then I don't know what to make racks? Or do racks not matter in a single > AZ? That way I can get an ack from a LOCAL_QUORUM read/write before the > (slightly) slower read/write to/from the other AZ (for redundancy). Then > I'm only screwed if Amazon has a multi-AZ failure (so far, they've kept it > to "only" one!) :-) > > will > > On Tue, Apr 26, 2011 at 5:01 PM, aaron morton <aa...@thelastpickle.com>wrote: > >> One difference between Cassandra and MySQL replication may be when the >> network IO happens. Was the MySQL replication synchronous on transaction >> commit ? I was only aware that it had async replication, which means the >> client is not exposed to the network latency. In cassandra the network >> latency is exposed to the client as it needs to wait for the CL number of >> nodes to respond. >> >> If you use the PropertyFilePartitioner with the NetworkTopology you can >> manually assign machines to racks / dc's based on IP. >> See conf/cassandra-topology.property file there is also an Ec2Snitch which >> (from the code) >> /** >> * A snitch that assumes an EC2 region is a DC and an EC2 >> availability_zone >> * is a rack. This information is available in the config for the node. >> >> Recent discussion on DC aware CL levels >> http://www.mail-archive.com/user@cassandra.apache.org/msg11414.html >> >> Hope that helps. >> <http://www.mail-archive.com/user@cassandra.apache.org/msg11414.html> >> Aaron >> >> >> On 27 Apr 2011, at 01:18, William Oberman wrote: >> >> Thanks Aaron! >> >> Unless no one on this list uses EC2, there were a few minor troubles end >> of last week through the weekend which taught me a lot about obscure failure >> modes in various applications I use :-) My original post was trying to be >> more redundant than fast, which has been by overall goal from even before >> moving to Cassandra (my downtime from the EC2 madness was minimal, and due >> to only having one single point of failure == the amazon load balancer). My >> secondary goal was trying to make moving to a second region easier, but is >> that is causing problems I can drop the idea. >> >> I might be downplaying the cost of inter-AZ communication, but I've lived >> with that for quite some time, for example my current setup of MySQL in >> Master-Master replication is split over zones, and my webservers live in yet >> different zones. Maybe Cassandra is "chattier" than I'm used to? (again, >> I'm fairly new to cassandra) >> >> Based on that article, the discussion, and the recent EC2 issues, it >> sounds like it would be better to start with: >> -6 nodes split in two AZs 3/3 >> -Configure replication to do 2 in one AZ and one in the other >> (NetworkTopology treats AZs as racks, so does RF=3,us-east=3 make this >> happen naturally?) >> -What does LOCAL_QUORUM do in this case? Is there a "rack quorum"? Or >> does the natural latencies of AZs make LOCAL_QUORUM behave like a rack >> quorum? >> >> will >> >> On Tue, Apr 26, 2011 at 1:14 AM, aaron morton <aa...@thelastpickle.com>wrote: >> >>> For background see this article: >>> >>> http://www.datastax.com/dev/blog/deploying-cassandra-across-multiple-data-centers >>> >>> >>> <http://www.datastax.com/dev/blog/deploying-cassandra-across-multiple-data-centers>And >>> this recent discussion >>> http://www.mail-archive.com/user@cassandra.apache.org/msg12502.html >>> >>> <http://www.mail-archive.com/user@cassandra.apache.org/msg12502.html>Issues >>> that may be a concern: >>> - lots of cross AZ latency in us-east, e.g. LOCAL_QUORUM ops must wait >>> cross AZ . Also consider it during maintenance tasks, how much of a pain is >>> it going to be to have latency between every node. >>> - IMHO not having sufficient (by that I mean 3) replicas in a cassandra >>> DC to handle a single node failure when working at Quorum reduces the >>> utility of the DC. e.g. with a local RF of 2 in the west, the quorum is 2, >>> and if you lose one node from the replica set you will not be able to use >>> local QUORUM for keys in that range. Or consider a failure mode where the >>> west is disconnected from the east. >>> >>> Could you start simple with 3 replicas in one AZ in us-east and 3 >>> replicas in an AZ+Region ? Then work through some failure scenarios. >>> >>> Hope that helps. >>> Aaron >>> >>> >>> On 22 Apr 2011, at 03:28, William Oberman wrote: >>> >>> Hi, >>> >>> My service is not yet ready to be fully multi-DC, due to how some of my >>> legacy MySQL stuff works. But, I wanted to get cassandra going ASAP and >>> work towards multi-DC. I have two main cassandra use cases: one where I can >>> handle eventual consistency (and all of the writes/reads are currently ONE), >>> and one where I can't (writes/reads are currently QUORUM). My test cluster >>> is currently 4 smalls all in us-east with RF=3 (more to prove I can >>> clustering, than to have an exact production replica). All of my unit >>> tests, and "load tests" (again, not to prove true max load, but to more to >>> tease out concurrency issues) are passing now. >>> >>> For production, I was thinking of doing: >>> -4 cassandra larges in us-east (where I am now), once in each AZ >>> -1 cassandra large in us-west (where I have nothing) >>> For now, my data can fit into a single large's 2 disk ephemeral using >>> RAID0, and I was then thinking of doing a RF=3 with us-east=2 and >>> us-west=1. If I do eventual consistency at ONE, and consistency at >>> LOCAL_QUORUM, I was hoping: >>> -eventual consistency ops would be really fast >>> -consistent ops would be pretty fast (what does LOCAL_QUORUM do in this >>> case? return after 1 or 2 us-east nodes ack?) >>> -us-west would contain a complete copy of my data, so it's a good >>> eventually consistent "close to real time" backup (assuming it can keep up >>> over long periods of time, but I think it should) >>> -eventually, when I'm ready to roll out in us-west I'll be able to change >>> the replication settings and that server in us-west could help seed new >>> cassandra instances faster than the ones in us-east >>> >>> Or am I missing something really fundamental about how cassandra works >>> making this a terrible plan? I should have plenty of time to get my >>> multi-DC working before the instance in us-west fills up (but even then, I >>> should be able to add instances over there to stall fairly trivially, >>> right?). >>> >>> Thanks! >>> >>> will >>> >>> >>> >> >> >> -- >> Will Oberman >> Civic Science, Inc. >> 3030 Penn Avenue., First Floor >> Pittsburgh, PA 15201 >> (M) 412-480-7835 >> (E) ober...@civicscience.com >> >> >> > > > -- > Will Oberman > Civic Science, Inc. > 3030 Penn Avenue., First Floor > Pittsburgh, PA 15201 > (M) 412-480-7835 > (E) ober...@civicscience.com > > > -- Will Oberman Civic Science, Inc. 3030 Penn Avenue., First Floor Pittsburgh, PA 15201 (M) 412-480-7835 (E) ober...@civicscience.com