I have a system where we're currently using Postgres for all our data storage needs, but on a large table the index checks for primary keys are really slowing us down on insert. Cassandra sounds like a good alternative (not saying postgres and cassandra are equivalent; just that I think they are both reasonable fits for our particular product), so I tried running the py_stress tool on a recent repos checkout. I'm using code that's recent enough that it doesn't pay attention to the keyspace definitions in cassandra.yaml, so whatever the values are for cached info is just what py_stress defined when it made the keyspace it uses. I didn't change anything in cassandra.yaml, but I did change cassandra.in.sh to use 2G of RAM rather than 1G. I then ran "python stress.py -o insert -n 1000000000" (that's one billion). I left for a day, and when I came back cassandra had run out of RAM, and stress.py had crashed at somewhere around 120,000,000 inserts. This brings up a few questions:
- is Cassandra's RAM use proportional to the number of values that it's storing? I know that it uses bloom filters for preventing lookups of non-existent keys, but since bloom filters are designed to give an accuracy/space tradeoff, Cassandra should sacrifice accuracy in order to prevent crashes, if it's just bloom filters that are using all the RAM - When I start Cassandra again, it appears to go into an eternal read/write loop, using between 45% and 90% of my CPU. It says it's compacting tables, but it's been doing that for hours, and it only has 70GB of data stored. How can cassandra be run on huge datasets, when 70GB appears to take forever to compact? I assume I'm doing something wrong, but I don't see a ton of tunables to play with. Can anybody give me advice on how to make cassandra keep running under a high insert load?