On 03/25/2011 10:12 AM, Jonathan Ellis wrote:
On Fri, Mar 25, 2011 at 11:59 AM, ian douglas<i...@armorgames.com>  wrote:
(we're running v0.60)
I don't know if you could hear that from where you are, but our whole
office just yelled, "WTF!" :)

Ah, that's what that noise was... And yeah, we know we're way behind. Our initial delay in upgrading was waiting for 0.7 to come out and then we learned we needed a whole new Thrift client for our PHP code base, and then we got busy on other things, but we're at a point where we have some time to take care of Cassandra and get it upgraded.

 Our planned path, now, is:

(our nodes' tokens are numbered using the python code (0, 1/3 and 2/3 times 2^127), and called node 1 through 3, respectively; our RF is set to 2 right now)

1. remove node 1 from our software
2. bring node 1 offline after a flush/repair/cleanup
3. run a cleanup on node 2 and then on node 3 so they have a full copy of all data from the old node 1 and each other. 4. bring up a new Large 64-bit instance, install 0.6.12, assign a Token value of 0 (node 1), RF:2, on a new gossip ring, and copy all data from the 32-bit nodes 2 and 3 and run a repair/cleanup to remove any duplicated data
5. remove node 3 from our software
6. point our code to the new 64-bit node 1
7. bring node 3 offline after a flush/repair/cleanup so node 2 has the last fresh copy of everything
8. bring node 2 offline after a flush/repair/cleanup
9. bring up another Large instance, get a copy of all data from our old node 2, assign a Token value of (1/2 * 2^127), RF:2, on the new gossip ring, run a repair to remove duplicate data, and then a cleanup so it gets replicated data from the new node 1
10. add the new node 2 to our software
11. run a final cleanup on the new node 1 and then on node 2 to make sure all data is replicated evenly on both nodes

... at this point, we should have two 64-bit Large instances, with RF:2, on a new gossip ring, replacing three 32-bit systems, with minimal down time and no data loss (just a data delay between steps 6 and 10 above).

Questions:
1. Does it appear that we've missed any steps, or doing something out of order? 2. Is the flush/repair/cleanup overkill when bringing the old nodes offline, or is that the correct sequence to follow? 3. Will the difference in compute units (lower on Large instances than Medium instances) make any noticeable difference, or will the fact that the machine is 64-bit handle things efficiently enough such that a Large instance works harder than a Medium instance? (never did figure out their how their compute units work) 4. Can we follow similar steps when we're ready to upgrade to 0.7x and have our new Thrift client for PHP all squared away?


Thanks again for the help!!!

Reply via email to