On 03/25/2011 10:12 AM, Jonathan Ellis wrote:
On Fri, Mar 25, 2011 at 11:59 AM, ian douglas<i...@armorgames.com> wrote:
(we're running v0.60)
I don't know if you could hear that from where you are, but our whole
office just yelled, "WTF!" :)
Ah, that's what that noise was... And yeah, we know we're way behind.
Our initial delay in upgrading was waiting for 0.7 to come out and then
we learned we needed a whole new Thrift client for our PHP code base,
and then we got busy on other things, but we're at a point where we have
some time to take care of Cassandra and get it upgraded.
Our planned path, now, is:
(our nodes' tokens are numbered using the python code (0, 1/3 and 2/3
times 2^127), and called node 1 through 3, respectively; our RF is set
to 2 right now)
1. remove node 1 from our software
2. bring node 1 offline after a flush/repair/cleanup
3. run a cleanup on node 2 and then on node 3 so they have a full copy
of all data from the old node 1 and each other.
4. bring up a new Large 64-bit instance, install 0.6.12, assign a Token
value of 0 (node 1), RF:2, on a new gossip ring, and copy all data from
the 32-bit nodes 2 and 3 and run a repair/cleanup to remove any
duplicated data
5. remove node 3 from our software
6. point our code to the new 64-bit node 1
7. bring node 3 offline after a flush/repair/cleanup so node 2 has the
last fresh copy of everything
8. bring node 2 offline after a flush/repair/cleanup
9. bring up another Large instance, get a copy of all data from our old
node 2, assign a Token value of (1/2 * 2^127), RF:2, on the new gossip
ring, run a repair to remove duplicate data, and then a cleanup so it
gets replicated data from the new node 1
10. add the new node 2 to our software
11. run a final cleanup on the new node 1 and then on node 2 to make
sure all data is replicated evenly on both nodes
... at this point, we should have two 64-bit Large instances, with RF:2,
on a new gossip ring, replacing three 32-bit systems, with minimal down
time and no data loss (just a data delay between steps 6 and 10 above).
Questions:
1. Does it appear that we've missed any steps, or doing something out of
order?
2. Is the flush/repair/cleanup overkill when bringing the old nodes
offline, or is that the correct sequence to follow?
3. Will the difference in compute units (lower on Large instances than
Medium instances) make any noticeable difference, or will the fact that
the machine is 64-bit handle things efficiently enough such that a Large
instance works harder than a Medium instance? (never did figure out
their how their compute units work)
4. Can we follow similar steps when we're ready to upgrade to 0.7x and
have our new Thrift client for PHP all squared away?
Thanks again for the help!!!