> It's impossible to start new connections, or impossible to send requests, or 
> it just doesn't return anything when you've sent a request.
If it's totally frozen it sounds like GC. How long does it freeze for?

> Despite that, we occasionally get OOM exceptions, and nodes crashing, maybe a 
> few times per month. 
Do you have an error stack ?

> We can't find anything in the cassandra logs indicating that something's up
Is it logging dropped messages or high TP pending ? Are the freezes associated 
with compaction or repair running?

>  and occasionally we do bulk deletion of supercolumns in a row.
mmm, are you sending a batch mutation with lots-o-deletes ? Each row mutation 
(insert or delete) in the batch becomes a thread pool tasks. If you send 1,000 
rows in a batch you will temporarily prevent other requests from being served.

> The config options we are unsure about are things like commit log sizes, ….
I would try to find some indication of what's going on before tweaking. Have 
you checked iostat ?

Hope that helps. 

-----------------
Aaron Morton
Freelance Developer
@aaronmorton
http://www.thelastpickle.com

On 11/09/2012, at 2:05 AM, Henrik Schröder <skro...@gmail.com> wrote:

> Hi all,
> 
> We're running a small Cassandra cluster (v1.0.10) serving data to our web 
> application, and as our traffic grows, we're starting to see some weird 
> issues. The biggest of these is that sometimes, a single node becomes 
> unresponsive. It's impossible to start new connections, or impossible to send 
> requests, or it just doesn't return anything when you've sent a request. Our 
> client library is set to retry on another server when this happens, and what 
> we see then is that the request is usually served instantly. So it's not the 
> case that some requests are very difficult, it's that sometimes a node is 
> just "busy", and we have no idea why or what it's doing.
> 
> We're using MRTG and Monit to monitor the servers, and in MRTG the average 
> CPU usage is around 5%, on our quad-core Xeon servers with SSDs. But 
> occassionally through Monit we can see that the 1-min load average goes above 
> 4, and this usually corresponds to the above issues. Is this common? Does 
> this happen to everyone else? And why the spikiness in load? We can't find 
> anything in the cassandra logs indicating that something's up (such as a slow 
> GC or compaction), and there's no corresponding traffic spike in the 
> application either. Should we just add more nodes if any single one gets CPU 
> spikes?
> 
> Another explanation could also be that we've configured it wrong. We're 
> running pretty much default config. Each node has 16G of RAM, 4GB of heap, no 
> row-cache and an sizeable key-cache. Despite that, we occasionally get OOM 
> exceptions, and nodes crashing, maybe a few times per month. Should we 
> increase heap size? Or move to 1.1 and enable off-heap caching?
> 
> We have quite a lot of traffic to the cluster. A single keyspace with two 
> column families, RF=3, compression is enabled, and we're running the default 
> size-tiered compaction.
> Column family A has 60GB of actual data, each row has a single column, and 
> that column holds binary data that varies in size up to 500kB. When we update 
> a row, we write a new value to this single column, effectively replacing that 
> entire row. We do ~1000 updates/s, totalling ~10MB/s in writes.
> Column family B also has 60GB of actual data, but each row has a variable 
> (~100-10000) number of supercolumns, and each supercolumn has a fixed number 
> of columns with a fixed amount of data, totalling ~1kB. The operations we are 
> doing on this column family is that we add supercolumns to rows with a rate 
> of ~200/s, and occasionally we do bulk deletion of supercolumns in a row.
> 
> The config options we are unsure about are things like commit log sizes, 
> memtable flushing thresholds, commit log syncing, compaction throughput, etc. 
> Are we at a point with our data size and write load that the defaults aren't 
> good enough anymore? Should we stick with size-tiered compaction, even though 
> our application is update-heavy?
> 
> 
> Many thanks,
> /Henrik

Reply via email to