Re: Working backwards from production to staging/dev

ian douglas Fri, 25 Mar 2011 11:12:33 -0700

On 03/25/2011 10:12 AM, Jonathan Ellis wrote:

On Fri, Mar 25, 2011 at 11:59 AM, ian douglas<i...@armorgames.com>  wrote:

(we're running v0.60)

I don't know if you could hear that from where you are, but our whole
office just yelled, "WTF!" :)

Ah, that's what that noise was... And yeah, we know we're way behind.Our initial delay in upgrading was waiting for 0.7 to come out and thenwe learned we needed a whole new Thrift client for our PHP code base,and then we got busy on other things, but we're at a point where we havesome time to take care of Cassandra and get it upgraded.


 Our planned path, now, is:

(our nodes' tokens are numbered using the python code (0, 1/3 and 2/3times 2^127), and called node 1 through 3, respectively; our RF is setto 2 right now)


1. remove node 1 from our software
2. bring node 1 offline after a flush/repair/cleanup

3. run a cleanup on node 2 and then on node 3 so they have a full copyof all data from the old node 1 and each other.4. bring up a new Large 64-bit instance, install 0.6.12, assign a Tokenvalue of 0 (node 1), RF:2, on a new gossip ring, and copy all data fromthe 32-bit nodes 2 and 3 and run a repair/cleanup to remove anyduplicated data

5. remove node 3 from our software
6. point our code to the new 64-bit node 1

7. bring node 3 offline after a flush/repair/cleanup so node 2 has thelast fresh copy of everything

8. bring node 2 offline after a flush/repair/cleanup

9. bring up another Large instance, get a copy of all data from our oldnode 2, assign a Token value of (1/2 * 2^127), RF:2, on the new gossipring, run a repair to remove duplicate data, and then a cleanup so itgets replicated data from the new node 1

10. add the new node 2 to our software

11. run a final cleanup on the new node 1 and then on node 2 to makesure all data is replicated evenly on both nodes

... at this point, we should have two 64-bit Large instances, with RF:2,on a new gossip ring, replacing three 32-bit systems, with minimal downtime and no data loss (just a data delay between steps 6 and 10 above).


Questions:

1. Does it appear that we've missed any steps, or doing something out oforder?2. Is the flush/repair/cleanup overkill when bringing the old nodesoffline, or is that the correct sequence to follow?3. Will the difference in compute units (lower on Large instances thanMedium instances) make any noticeable difference, or will the fact thatthe machine is 64-bit handle things efficiently enough such that a Largeinstance works harder than a Medium instance? (never did figure outtheir how their compute units work)4. Can we follow similar steps when we're ready to upgrade to 0.7x andhave our new Thrift client for PHP all squared away?



Thanks again for the help!!!

Re: Working backwards from production to staging/dev

Reply via email to