> It's impossible to start new connections, or impossible to send requests, or > it just doesn't return anything when you've sent a request. If it's totally frozen it sounds like GC. How long does it freeze for?
> Despite that, we occasionally get OOM exceptions, and nodes crashing, maybe a > few times per month. Do you have an error stack ? > We can't find anything in the cassandra logs indicating that something's up Is it logging dropped messages or high TP pending ? Are the freezes associated with compaction or repair running? > and occasionally we do bulk deletion of supercolumns in a row. mmm, are you sending a batch mutation with lots-o-deletes ? Each row mutation (insert or delete) in the batch becomes a thread pool tasks. If you send 1,000 rows in a batch you will temporarily prevent other requests from being served. > The config options we are unsure about are things like commit log sizes, …. I would try to find some indication of what's going on before tweaking. Have you checked iostat ? Hope that helps. ----------------- Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 11/09/2012, at 2:05 AM, Henrik Schröder <skro...@gmail.com> wrote: > Hi all, > > We're running a small Cassandra cluster (v1.0.10) serving data to our web > application, and as our traffic grows, we're starting to see some weird > issues. The biggest of these is that sometimes, a single node becomes > unresponsive. It's impossible to start new connections, or impossible to send > requests, or it just doesn't return anything when you've sent a request. Our > client library is set to retry on another server when this happens, and what > we see then is that the request is usually served instantly. So it's not the > case that some requests are very difficult, it's that sometimes a node is > just "busy", and we have no idea why or what it's doing. > > We're using MRTG and Monit to monitor the servers, and in MRTG the average > CPU usage is around 5%, on our quad-core Xeon servers with SSDs. But > occassionally through Monit we can see that the 1-min load average goes above > 4, and this usually corresponds to the above issues. Is this common? Does > this happen to everyone else? And why the spikiness in load? We can't find > anything in the cassandra logs indicating that something's up (such as a slow > GC or compaction), and there's no corresponding traffic spike in the > application either. Should we just add more nodes if any single one gets CPU > spikes? > > Another explanation could also be that we've configured it wrong. We're > running pretty much default config. Each node has 16G of RAM, 4GB of heap, no > row-cache and an sizeable key-cache. Despite that, we occasionally get OOM > exceptions, and nodes crashing, maybe a few times per month. Should we > increase heap size? Or move to 1.1 and enable off-heap caching? > > We have quite a lot of traffic to the cluster. A single keyspace with two > column families, RF=3, compression is enabled, and we're running the default > size-tiered compaction. > Column family A has 60GB of actual data, each row has a single column, and > that column holds binary data that varies in size up to 500kB. When we update > a row, we write a new value to this single column, effectively replacing that > entire row. We do ~1000 updates/s, totalling ~10MB/s in writes. > Column family B also has 60GB of actual data, but each row has a variable > (~100-10000) number of supercolumns, and each supercolumn has a fixed number > of columns with a fixed amount of data, totalling ~1kB. The operations we are > doing on this column family is that we add supercolumns to rows with a rate > of ~200/s, and occasionally we do bulk deletion of supercolumns in a row. > > The config options we are unsure about are things like commit log sizes, > memtable flushing thresholds, commit log syncing, compaction throughput, etc. > Are we at a point with our data size and write load that the defaults aren't > good enough anymore? Should we stick with size-tiered compaction, even though > our application is update-heavy? > > > Many thanks, > /Henrik