I have a system where we're currently using Postgres for all our data
storage needs, but on a large table the index checks for primary keys
are really slowing us down on insert.  Cassandra sounds like a good
alternative (not saying postgres and cassandra are equivalent; just
that I think they are both reasonable fits for our particular
product), so I tried running the py_stress tool on a recent repos
checkout.  I'm using code that's recent enough that it doesn't pay
attention to the keyspace definitions in cassandra.yaml, so whatever
the values are for cached info is just what py_stress defined when it
made the keyspace it uses.  I didn't change anything in
cassandra.yaml, but I did change cassandra.in.sh to use 2G of RAM
rather than 1G.  I then ran "python stress.py -o insert -n 1000000000"
(that's one billion).  I left for a day, and when I came back
cassandra had run out of RAM, and stress.py had crashed at somewhere
around 120,000,000 inserts.  This brings up a few questions:

- is Cassandra's RAM use proportional to the number of values that
it's storing?  I know that it uses bloom filters for preventing
lookups of non-existent keys, but since bloom filters are designed to
give an accuracy/space tradeoff, Cassandra should sacrifice accuracy
in order to prevent crashes, if it's just bloom filters that are using
all the RAM

- When I start Cassandra again, it appears to go into an eternal
read/write loop, using between 45% and 90% of my CPU.  It says it's
compacting tables, but it's been doing that for hours, and it only has
70GB of data stored.  How can cassandra be run on huge datasets, when
70GB appears to take forever to compact?

I assume I'm doing something wrong, but I don't see a ton of tunables
to play with.  Can anybody give me advice on how to make cassandra
keep running under a high insert load?

Reply via email to