Hi Jedd, I'm using Cassandra on EC2 as well - so I'm quite interested.
Just to clarify your post - it sounds like you have 4 questions/issue: 1. Writes have slowed down significantly. What's the logical explanation? And what is the logical solution/options to solve it? 2. You grew from 2 nodes to 4, but the original 2 nodes have 200GB and the 2 new ones have 40 GB. What's the recommended practice for rebalancing (i.e., when should you do it), what's the actual procedure, and what's the expected impact of it? 3. Cassandra nodes "disappear". (I'm not quite clear what this means.) 4. You took a machine offline without decommissioning it from the cluster. Now the machine is gone, but the other nodes (in Gossip logs) report that they are still looking for it. How do you stop nodes from looking for a removed node? I'm not trying to put words in your mouth - but I want to make sure that I understand what you're asking about (because I have similar ec2-related thoughts). Let me know if this is an accurate summary. Dave Viner On Fri, Sep 17, 2010 at 7:41 AM, Jedd Rashbrooke < jedd.rashbro...@imagini.net> wrote: > Howdi, > > I've just landed in an experiment to get Cassandra going, and > fed by PHP via Thrift via Hadoop, all running on EC2. I've been > lurking a bit on the list for a couple of weeks, mostly reading any > threads with the word 'performance' in them. Few people have > anything polite to say about EC2, but I want to just throw out > some observations and get some feedback on whether what > I'm seeing is even approaching any kind of normal. > > My background is mostly *nix and networking, with half-way > decent understanding of DB's -- but Cassandra, Hadoop, Thrift > and EC2 are all fairly new to me. > > We're using a four-node decently-specced (m2.2xlarge, if you're > EC2-aware) cluster - 32GB, 4-core, if you're not :) I'm using > Ubuntu with the Deb packages for Cassandra and Hadoop, and > some fairly conservative tweaks to things like JVM memory > (bumping them up to 4GB, then 16GB). > > One of our insert jobs - a mapper only process - was running > pretty fast a few days ago. Somewhere around a million lines > of input, split into a dozen files, inserting via a Hadoop job in > about a half hour. Happy times. This was when the cluster > was modestly sized - 20-50GB. It's now about 200GB, and > performance has dropped by an order of magnitude - perhaps > 5-6 hours to do the same amount of work, using the same > codebase and the same input data. > > I've read that reads slow down as the DB grows, but had an > expectation that writes would be consistently snappy. How > surprising is this performance drop given the DB growth? > > My 4-node cluster started off as a 2-node - and now nodetool > ring suggests the two original nodes are 200GB each, and > the newer two are 40GB. Is this normal? Would a rebalance > likely improve performance substantially? My feeling is that > it would be expensive to perform. > > EC2 seems to get a bad rap, and we're feeling quite a bit of > pain, which is sad given the (on paper) spec of the machines, > and the cost - over US$3k/month for the cluster. I've split > Cassandra commitlog, Cassandra data, hadoop(hdfs) and > tmp onto separate 'spindles' - observations so far suggest > late '90's disk IO speed (15MB max sustained writes, one > machine, one disk to another), and consistently inconsistent > performance (identical machine next to it running the same > task at the same time was getting 28MB) over several hours. > > Cassandra nodes seems to disappear too easily - even > with just one core (out of four) maxed out with a jsvc task, > minimal disk or network activity, the machine feels very > sluggish. Tailing the cassandra logs hints that it's doing > hinted handoffs and occasionally compaction tasks. I've > never seen this kind of behaviour - and suspect this is > more a feature of EC2. > > Gossip now seems to be pining the loss of an older machine > (that I stupidly took offline briefly - EC2 gave it a new IP address > when it came back). There's nothing in the storage-conf to > refer to the old address, all 4 Cassandra daemons have been > re-started several times since, but gossip occasionally (a day > later) says that it is looking for it - and more worrying that > it is 'now part of the cluster'. I'm unsure if this is just an > irritation or part of the underlying problem. > > What I'm going to do next is to try importing some data into > a local machine - it's just time-consuming to pull in our S3 > data - and see if I can fake up to around the same capacity > and watch for performance degradation. > > I'm also toying with the idea of going from 4 to 8 nodes, > but I'm clueless on whether / how much this would help. > > As I say, though, I'm keen on anyone else's observations on > my observations - I'm painfully aware that I'm juggling a lot > of unknown factors at the moment. > > cheers, > Jedd. >