Yes, some sort of data structure to coordinate this could reduce the problem as well. I made some comments on that at the end of 2558.
I believe a coordinator could be in place both to - plan the start of compaction and - to coordinate compaction thread shutdown and tmp file deletion before we completely run out of disk space Regards, Terje On Wed, May 4, 2011 at 10:09 PM, Jonathan Ellis <jbel...@gmail.com> wrote: > Or we could "reserve" space when starting a compaction. > > On Wed, May 4, 2011 at 2:32 AM, Terje Marthinussen > <tmarthinus...@gmail.com> wrote: > > Partially, I guess this may be a side effect of multithreaded > compactions? > > Before running out of space completely, I do see a few of these: > > WARN [CompactionExecutor:448] 2011-05-02 01:08:10,480 > > CompactionManager.java (line 516) insufficient space to compact all > > requested files SSTableReader(path='/data/cassandra/JP_MALL_P > > H/Order-f-12858-Data.db'), > > SSTableReader(path='/data/cassandra/JP_MALL_PH/Order-f-12851-Data.db'), > > SSTableReader(path='/data/cassandra/JP_MALL_PH/Order-f-12864-Data.db') > > INFO [CompactionExecutor:448] 2011-05-02 01:08:10,481 > StorageService.java > > (line 2066) requesting GC to free disk space > > In this case, there would be 24 threads that asked if there was empty > disk > > space. > > Most of them probably succeeded in that request, but they could have > > requested 24x available space in theory since I do not think there is any > > global pool of used disk in place that manages which how much disk space > > will be needed for already started compactions? > > Of course, regardless how much checking there is in advance, we could > still > > run out of disk, so I guess there is also a need for checking if > diskspace > > is about to run out while compaction runs so things may be > halted/aborted. > > Unfortunately that would need global coordination so we do not stop all > > compaction threads.... > > After reducing to 6 compaction threads in 0.8 beta2, the data has > compacted > > just fine without any disk space issues, so I guess another problem you > may > > hit as you get a lot of sstables which have updates (that is, duplicates) > to > > the same data, is that of course, the massively concurrent compaction > taking > > place with nproc threads could also concurrently duplicate all the > > duplicates on a large scale. > > Yes, this is in favour of multithreaded compaction as it should normally > > help keeping sstables to a sane level and avoid such problems, but it is > > unfortunately just a kludge to the real problem which is to segment the > > sstables somehow on keyspace so we can get down the disk requirements and > > recover from scenarios where disk gets above 50%. > > Regards, > > Terje > > > > > > On Wed, May 4, 2011 at 3:33 PM, Terje Marthinussen < > tmarthinus...@gmail.com> > > wrote: > >> > >> Well, just did not look at these logs very well at all last night > >> First out of disk message: > >> ERROR [CompactionExecutor:387] 2011-05-02 01:16:01,027 > >> AbstractCassandraDaemon.java (line 112) Fatal exception in thread > >> Thread[CompactionExecutor:387,1,main] > >> java.io.IOException: No space left on device > >> Then finally the last one > >> ERROR [FlushWriter:128] 2011-05-02 01:51:06,112 > >> AbstractCassandraDaemon.java (line 112) Fatal exception in thread > >> Thread[FlushWriter:128,5,main] > >> java.lang.RuntimeException: java.lang.RuntimeException: Insufficient > disk > >> space to flush 554962 bytes > >> at > >> org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:34) > >> at > >> > java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) > >> at > >> > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) > >> at java.lang.Thread.run(Thread.java:662) > >> Caused by: java.lang.RuntimeException: Insufficient disk space to flush > >> 554962 bytes > >> at > >> > org.apache.cassandra.db.ColumnFamilyStore.getFlushPath(ColumnFamilyStore.java:597) > >> at > >> > org.apache.cassandra.db.ColumnFamilyStore.createFlushWriter(ColumnFamilyStore.java:2100) > >> at > >> org.apache.cassandra.db.Memtable.writeSortedContents(Memtable.java:239) > >> at org.apache.cassandra.db.Memtable.access$400(Memtable.java:50) > >> at > >> org.apache.cassandra.db.Memtable$3.runMayThrow(Memtable.java:263) > >> at > >> org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:30) > >> ... 3 more > >> INFO [CompactionExecutor:451] 2011-05-02 01:51:06,113 > StorageService.java > >> (line 2066) requesting GC to free disk space > >> [lots of sstables deleted] > >> After this is starts running again (although not fine it seems). > >> So the disk seems to have been full for 35 minutes due to un-deleted > >> sstables. > >> Terje > >> On Wed, May 4, 2011 at 6:34 AM, Terje Marthinussen > >> <tmarthinus...@gmail.com> wrote: > >>> > >>> Hm... peculiar. > >>> Post flush is not involved in compactions, right? > >>> > >>> May 2nd > >>> 01:06 - Out of disk > >>> 01:51 - Starts a mix of major and minor compactions on different column > >>> families > >>> It then starts a few minor compactions extra over the day, but given > that > >>> there are more than 1000 sstables, and we are talking 3 minor > compactions > >>> started, it is not normal I think. > >>> May 3rd 1 minor compaction started. > >>> When I checked today, there was a bunch of tmp files on the disk with > >>> last modify time from 01:something on may 2nd and 200GB empty disk... > >>> Definitely no compaction going on. > >>> Guess I will add some debug logging and see if I get lucky and run out > of > >>> disk again. > >>> Terje > >>> On Wed, May 4, 2011 at 5:06 AM, Jonathan Ellis <jbel...@gmail.com> > wrote: > >>>> > >>>> Compaction does, but flush didn't until > >>>> https://issues.apache.org/jira/browse/CASSANDRA-2404 > >>>> > >>>> On Tue, May 3, 2011 at 2:26 PM, Terje Marthinussen > >>>> <tmarthinus...@gmail.com> wrote: > >>>> > Yes, I realize that. > >>>> > I am bit curious why it ran out of disk, or rather, why I have 200GB > >>>> > empty > >>>> > disk now, but unfortunately it seems like we may not have had > >>>> > monitoring > >>>> > enabled on this node to tell me what happened in terms of disk > usage. > >>>> > I also thought that compaction was supposed to resume (try again > with > >>>> > less > >>>> > data) if it fails? > >>>> > Terje > >>>> > > >>>> > On Wed, May 4, 2011 at 3:50 AM, Jonathan Ellis <jbel...@gmail.com> > >>>> > wrote: > >>>> >> > >>>> >> post flusher is responsible for updating commitlog header after a > >>>> >> flush; each task waits for a specific flush to complete, then does > >>>> >> its > >>>> >> thing. > >>>> >> > >>>> >> so when you had a flush catastrophically fail, its corresponding > >>>> >> post-flush task will be stuck. > >>>> >> > >>>> >> On Tue, May 3, 2011 at 1:20 PM, Terje Marthinussen > >>>> >> <tmarthinus...@gmail.com> wrote: > >>>> >> > Just some very tiny amount of writes in the background here (some > >>>> >> > hints > >>>> >> > spooled up on another node slowly coming in). > >>>> >> > No new data. > >>>> >> > > >>>> >> > I thought there was no exceptions, but I did not look far enough > >>>> >> > back in > >>>> >> > the > >>>> >> > log at first. > >>>> >> > Going back a bit further now however, I see that about 50 hours > >>>> >> > ago: > >>>> >> > ERROR [CompactionExecutor:387] 2011-05-02 01:16:01,027 > >>>> >> > AbstractCassandraDaemon.java (line 112) Fatal exception in thread > >>>> >> > Thread[CompactionExecutor:387,1,main] > >>>> >> > java.io.IOException: No space left on device > >>>> >> > at java.io.RandomAccessFile.writeBytes(Native Method) > >>>> >> > at > >>>> >> > java.io.RandomAccessFile.write(RandomAccessFile.java:466) > >>>> >> > at > >>>> >> > > >>>> >> > > >>>> >> > > org.apache.cassandra.io.util.BufferedRandomAccessFile.flush(BufferedRandomAccessFile.java:160) > >>>> >> > at > >>>> >> > > >>>> >> > > >>>> >> > > org.apache.cassandra.io.util.BufferedRandomAccessFile.reBuffer(BufferedRandomAccessFile.java:225) > >>>> >> > at > >>>> >> > > >>>> >> > > >>>> >> > > org.apache.cassandra.io.util.BufferedRandomAccessFile.writeAtMost(BufferedRandomAccessFile.java:356) > >>>> >> > at > >>>> >> > > >>>> >> > > >>>> >> > > org.apache.cassandra.io.util.BufferedRandomAccessFile.write(BufferedRandomAccessFile.java:335) > >>>> >> > at > >>>> >> > > >>>> >> > > org.apache.cassandra.io.PrecompactedRow.write(PrecompactedRow.java:102) > >>>> >> > at > >>>> >> > > >>>> >> > > >>>> >> > > org.apache.cassandra.io.sstable.SSTableWriter.append(SSTableWriter.java:130) > >>>> >> > at > >>>> >> > > >>>> >> > > >>>> >> > > org.apache.cassandra.db.CompactionManager.doCompaction(CompactionManager.java:566) > >>>> >> > at > >>>> >> > > >>>> >> > > >>>> >> > > org.apache.cassandra.db.CompactionManager$1.call(CompactionManager.java:146) > >>>> >> > at > >>>> >> > > >>>> >> > > >>>> >> > > org.apache.cassandra.db.CompactionManager$1.call(CompactionManager.java:112) > >>>> >> > at > >>>> >> > > java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) > >>>> >> > at > java.util.concurrent.FutureTask.run(FutureTask.java:138) > >>>> >> > at > >>>> >> > > >>>> >> > > >>>> >> > > java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) > >>>> >> > at > >>>> >> > > >>>> >> > > >>>> >> > > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) > >>>> >> > at java.lang.Thread.run(Thread.java:662) > >>>> >> > [followed by a few more of those...] > >>>> >> > and then a bunch of these: > >>>> >> > ERROR [FlushWriter:123] 2011-05-02 01:21:12,690 > >>>> >> > AbstractCassandraDaemon.java > >>>> >> > (line 112) Fatal exception in thread > Thread[FlushWriter:123,5,main] > >>>> >> > java.lang.RuntimeException: java.lang.RuntimeException: > >>>> >> > Insufficient > >>>> >> > disk > >>>> >> > space to flush 40009184 bytes > >>>> >> > at > >>>> >> > > >>>> >> > > org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:34) > >>>> >> > at > >>>> >> > > >>>> >> > > >>>> >> > > java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) > >>>> >> > at > >>>> >> > > >>>> >> > > >>>> >> > > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) > >>>> >> > at java.lang.Thread.run(Thread.java:662) > >>>> >> > Caused by: java.lang.RuntimeException: Insufficient disk space to > >>>> >> > flush > >>>> >> > 40009184 bytes > >>>> >> > at > >>>> >> > > >>>> >> > > >>>> >> > > org.apache.cassandra.db.ColumnFamilyStore.getFlushPath(ColumnFamilyStore.java:597) > >>>> >> > at > >>>> >> > > >>>> >> > > >>>> >> > > org.apache.cassandra.db.ColumnFamilyStore.createFlushWriter(ColumnFamilyStore.java:2100) > >>>> >> > at > >>>> >> > > >>>> >> > > org.apache.cassandra.db.Memtable.writeSortedContents(Memtable.java:239) > >>>> >> > at > >>>> >> > org.apache.cassandra.db.Memtable.access$400(Memtable.java:50) > >>>> >> > at > >>>> >> > org.apache.cassandra.db.Memtable$3.runMayThrow(Memtable.java:263) > >>>> >> > at > >>>> >> > > >>>> >> > > org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:30) > >>>> >> > ... 3 more > >>>> >> > Seems like compactions stopped after this (a bunch of tmp tables > >>>> >> > there > >>>> >> > still > >>>> >> > from when those errors where generated), and I can only suspect > the > >>>> >> > post > >>>> >> > flusher may have stopped at the same time. > >>>> >> > There is 890GB of disk for data, sstables are currently using > 604G > >>>> >> > (139GB is > >>>> >> > old tmp tables from when it ran out of disk) and "ring" tells me > >>>> >> > the > >>>> >> > load on > >>>> >> > the node is 313GB. > >>>> >> > Terje > >>>> >> > > >>>> >> > > >>>> >> > On Wed, May 4, 2011 at 3:02 AM, Jonathan Ellis < > jbel...@gmail.com> > >>>> >> > wrote: > >>>> >> >> > >>>> >> >> ... and are there any exceptions in the log? > >>>> >> >> > >>>> >> >> On Tue, May 3, 2011 at 1:01 PM, Jonathan Ellis < > jbel...@gmail.com> > >>>> >> >> wrote: > >>>> >> >> > Does it resolve down to 0 eventually if you stop doing writes? > >>>> >> >> > > >>>> >> >> > On Tue, May 3, 2011 at 12:56 PM, Terje Marthinussen > >>>> >> >> > <tmarthinus...@gmail.com> wrote: > >>>> >> >> >> Cassandra 0.8 beta trunk from about 1 week ago: > >>>> >> >> >> Pool Name Active Pending Completed > >>>> >> >> >> ReadStage 0 0 5 > >>>> >> >> >> RequestResponseStage 0 0 87129 > >>>> >> >> >> MutationStage 0 0 187298 > >>>> >> >> >> ReadRepairStage 0 0 0 > >>>> >> >> >> ReplicateOnWriteStage 0 0 0 > >>>> >> >> >> GossipStage 0 0 1353524 > >>>> >> >> >> AntiEntropyStage 0 0 0 > >>>> >> >> >> MigrationStage 0 0 10 > >>>> >> >> >> MemtablePostFlusher 1 190 108 > >>>> >> >> >> StreamStage 0 0 0 > >>>> >> >> >> FlushWriter 0 0 302 > >>>> >> >> >> FILEUTILS-DELETE-POOL 0 0 26 > >>>> >> >> >> MiscStage 0 0 0 > >>>> >> >> >> FlushSorter 0 0 0 > >>>> >> >> >> InternalResponseStage 0 0 0 > >>>> >> >> >> HintedHandoff 1 4 7 > >>>> >> >> >> > >>>> >> >> >> Anyone with nice theories about the pending value on the > >>>> >> >> >> memtable > >>>> >> >> >> post > >>>> >> >> >> flusher? > >>>> >> >> >> Regards, > >>>> >> >> >> Terje > >>>> >> >> > > >>>> >> >> > > >>>> >> >> > > >>>> >> >> > -- > >>>> >> >> > Jonathan Ellis > >>>> >> >> > Project Chair, Apache Cassandra > >>>> >> >> > co-founder of DataStax, the source for professional Cassandra > >>>> >> >> > support > >>>> >> >> > http://www.datastax.com > >>>> >> >> > > >>>> >> >> > >>>> >> >> > >>>> >> >> > >>>> >> >> -- > >>>> >> >> Jonathan Ellis > >>>> >> >> Project Chair, Apache Cassandra > >>>> >> >> co-founder of DataStax, the source for professional Cassandra > >>>> >> >> support > >>>> >> >> http://www.datastax.com > >>>> >> > > >>>> >> > > >>>> >> > >>>> >> > >>>> >> > >>>> >> -- > >>>> >> Jonathan Ellis > >>>> >> Project Chair, Apache Cassandra > >>>> >> co-founder of DataStax, the source for professional Cassandra > support > >>>> >> http://www.datastax.com > >>>> > > >>>> > > >>>> > >>>> > >>>> > >>>> -- > >>>> Jonathan Ellis > >>>> Project Chair, Apache Cassandra > >>>> co-founder of DataStax, the source for professional Cassandra support > >>>> http://www.datastax.com > >>> > >> > > > > > > > > -- > Jonathan Ellis > Project Chair, Apache Cassandra > co-founder of DataStax, the source for professional Cassandra support > http://www.datastax.com >