Thank you for the advice, I will try these settings. I am running defaults right now. The disk subsystem is one SATA disk for commitlog and 4 SATA disks in raid 0 for the data.
>From your email you are implying this hardware can not handle this level of sustained writes? That kind of breaks down the commodity server concept for me. I have never used anything but a 15k SAS disk (fastest disk money could buy until SSD) ALWAYS with a database. I have tried to throw out that mentality here but are you saying nothing has really changed/ Spindles spindles spindles as fast as you can afford is what I have always known...I guess that applies here? Do I need to spend $10k per node instead of $3.5k to get SUSTAINED 10k writes/sec per node? On Sat, Aug 21, 2010 at 11:03 PM, Benjamin Black <b...@b3k.us> wrote: > My guess is that you have (at least) 2 problems right now: > > You are writing 10k ops/sec to each node, but have default memtable > flush settings. This is resulting in memtable flushing every 30 > seconds (default ops flush setting is 300k). You thus have a > proliferation of tiny sstables and are seeing minor compactions > triggered every couple of minutes. > > You have started a major compaction which is now competing with those > near constant minor compactions for far too little I/O (3 SATA drives > in RAID0, perhaps?). Normally, this would result in a massive > ballooning of your heap use as all sorts of activities (like memtable > flushes) backed up, as well. > > I suggest you increase the memtable flush ops to at least 10 (million) > if you are going to sustain that many writes/sec, along with an > increase in the flush MB to match, based on your typical bytes/write > op. Long term, this level of write activity demands a lot faster > storage (iops and bandwidth). > > > b > On Sat, Aug 21, 2010 at 2:18 AM, Wayne <wav...@gmail.com> wrote: > > I am already running with those options. I thought maybe that is why they > > never get completed as they keep pushed pushed down in priority? I am > > getting timeouts now and then but for the most part the cluster keeps > > running. Is it normal/ok for the repair and compaction to take so long? > It > > has been over 12 hours since they were submitted. > > > > On Sat, Aug 21, 2010 at 10:56 AM, Jonathan Ellis <jbel...@gmail.com> > wrote: > >> > >> yes, the AES is the repair. > >> > >> if you are running linux, try adding the options to reduce compaction > >> priority from > >> http://wiki.apache.org/cassandra/PerformanceTuning > >> > >> On Sat, Aug 21, 2010 at 3:17 AM, Wayne <wav...@gmail.com> wrote: > >> > I could tell from munin that the disk utilization was getting crazy > >> > high, > >> > but the strange thing is that it seemed to "stall". The utilization > went > >> > way > >> > down and everything seemed to flatten out. Requests piled up and the > >> > node > >> > was doing nothing. It did not "crash" but was left in a useless state. > I > >> > do > >> > not have access to the tpstats when that occurred. Attached is the > munin > >> > chart, and you can see the flat line after Friday at noon. > >> > > >> > I have reduced the writers from 10 per to 8 per node and they seem to > be > >> > still running, but I am afraid they are barely hanging on. I ran > >> > nodetool > >> > repair after rebooting the failed node and I do not think the repair > >> > ever > >> > completed. I also later ran compact on each node and some it finished > >> > but > >> > some it did not. Below is the tpstats currently for the node I had to > >> > restart. Is the AE-SERVICE-STAGE the repair and compaction queued up? > >> > It > >> > seems several nodes are not getting enough free cycles to keep up. > They > >> > are > >> > not timing out (30 sec timeout) for the most part but they are also > not > >> > able > >> > to compact. Is this normal? Do I just give it time? I am migrating 2-3 > >> > TB of > >> > data from Mysql so the load is constant and will be for days and it > >> > seems > >> > even with only 8 writer processes per node I am maxed out. > >> > > >> > Thanks for the advice. Any more pointers would be greatly appreciated. > >> > > >> > Pool Name Active Pending Completed > >> > FILEUTILS-DELETE-POOL 0 0 1868 > >> > STREAM-STAGE 1 1 2 > >> > RESPONSE-STAGE 0 2 769158645 > >> > ROW-READ-STAGE 0 0 140942 > >> > LB-OPERATIONS 0 0 0 > >> > MESSAGE-DESERIALIZER-POOL 1 0 1470221842 > >> > GMFD 0 0 169712 > >> > LB-TARGET 0 0 0 > >> > CONSISTENCY-MANAGER 0 0 0 > >> > ROW-MUTATION-STAGE 0 1 865124937 > >> > MESSAGE-STREAMING-POOL 0 0 6 > >> > LOAD-BALANCER-STAGE 0 0 0 > >> > FLUSH-SORTER-POOL 0 0 0 > >> > MEMTABLE-POST-FLUSHER 0 0 8088 > >> > FLUSH-WRITER-POOL 0 0 8088 > >> > AE-SERVICE-STAGE 1 34 54 > >> > HINTED-HANDOFF-POOL 0 0 7 > >> > > >> > > >> > > >> > On Fri, Aug 20, 2010 at 11:56 PM, Bill de hÓra <b...@dehora.net> > wrote: > >> >> > >> >> On Fri, 2010-08-20 at 19:17 +0200, Wayne wrote: > >> >> > >> >> > WARN [MESSAGE-DESERIALIZER-POOL:1] 2010-08-20 16:57:02,602 > >> >> > MessageDeserializationTask.java (line 47) dropping message > >> >> > (1,078,378ms past timeout) > >> >> > WARN [MESSAGE-DESERIALIZER-POOL:1] 2010-08-20 16:57:02,602 > >> >> > MessageDeserializationTask.java (line 47) dropping message > >> >> > (1,078,378ms past timeout) > >> >> > >> >> MESSAGE-DESERIALIZER-POOL usually backs up when other stages are > bogged > >> >> downstream, (eg here's Ben Black describing the symptom when the > >> >> underlying cause is running out of disk bandwidth, well worth a watch > >> >> http://riptano.blip.tv/file/4012133/). > >> >> > >> >> Can you send all of nodetool tpstats? > >> >> > >> >> Bill > >> >> > >> > > >> > > >> > >> > >> > >> -- > >> Jonathan Ellis > >> Project Chair, Apache Cassandra > >> co-founder of Riptano, the source for professional Cassandra support > >> http://riptano.com > > > > >