Another thing to keep in mind is that if you are hitting the issue I described, waiting 60 seconds will not absolutely solve your problem, it will only make it less likely to occur. If a memtable has been partially flushed at the 60 second mark you will end up with the same corrupt sstable.
On Fri, Mar 28, 2014 at 1:32 PM, Laing, Michael <michael.la...@nytimes.com>wrote: > +1 for tablesnap > > > On Fri, Mar 28, 2014 at 4:28 PM, Jonathan Haddad <j...@jonhaddad.com>wrote: > >> I will +1 the recommendation on using tablesnap over EBS. S3 is at least >> predictable. >> >> Additionally, from a practical standpoint, you may want to back up your >> sstables somewhere. If you use S3, it's easy to pull just the new tables >> out via aws-cli tools (s3 sync), to your remote, non-aws server, and not >> incur the overhead of routinely backing up the entire dataset. For a non >> trivial database, this matters quite a bit. >> >> >> On Fri, Mar 28, 2014 at 1:21 PM, Laing, Michael < >> michael.la...@nytimes.com> wrote: >> >>> As I tried to say, EBS snapshots require much care or you get corruption >>> such as you have encountered. >>> >>> Does Cassandra quiesce the file system after a snapshot using fsfreeze >>> or xfs_freeze? Somehow I doubt it... >>> >>> >>> On Fri, Mar 28, 2014 at 4:17 PM, Jonathan Haddad <j...@jonhaddad.com>wrote: >>> >>>> I have a nagging memory of reading about issues with virtualization and >>>> not actually having durable versions of your data even after an fsync >>>> (within the VM). Googling around lead me to this post: >>>> http://petercai.com/virtualization-is-bad-for-database-integrity/ >>>> >>>> It's possible you're hitting this issue, with with the virtualization >>>> layer, or with EBS itself. Just a shot in the dark though, other people >>>> would likely know much more than I. >>>> >>>> >>>> >>>> On Fri, Mar 28, 2014 at 12:50 PM, Russ Lavoie <ussray...@yahoo.com>wrote: >>>> >>>>> Robert, >>>>> >>>>> That is what I thought as well. But apparently something is >>>>> happening. The only way I can get away with doing this is adding a sleep >>>>> 60 right after the nodetool snapshot is executed. I can reproduce this >>>>> 100% of the time by not issuing a sleep after nodetool snapshot. >>>>> >>>>> This is the error. >>>>> >>>>> ERROR [SSTableBatchOpen:1] 2014-03-28 17:08:14,290 >>>>> CassandraDaemon.java (line 191) Exception in thread >>>>> Thread[SSTableBatchOpen:1,5,main] >>>>> org.apache.cassandra.io.sstable.CorruptSSTableException: >>>>> java.io.EOFException >>>>> at >>>>> org.apache.cassandra.io.compress.CompressionMetadata.<init>(CompressionMetadata.java:108) >>>>> at >>>>> org.apache.cassandra.io.compress.CompressionMetadata.create(CompressionMetadata.java:63) >>>>> at >>>>> org.apache.cassandra.io.util.CompressedPoolingSegmentedFile$Builder.complete(CompressedPoolingSegmentedFile.java:42) >>>>> at >>>>> org.apache.cassandra.io.sstable.SSTableReader.load(SSTableReader.java:407) >>>>> at >>>>> org.apache.cassandra.io.sstable.SSTableReader.open(SSTableReader.java:198) >>>>> at >>>>> org.apache.cassandra.io.sstable.SSTableReader.open(SSTableReader.java:157) >>>>> at >>>>> org.apache.cassandra.io.sstable.SSTableReader$1.run(SSTableReader.java:262) >>>>> at >>>>> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) >>>>> at java.util.concurrent.FutureTask.run(FutureTask.java:262) >>>>> at >>>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) >>>>> at >>>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) >>>>> at java.lang.Thread.run(Thread.java:744) >>>>> Caused by: java.io.EOFException >>>>> at java.io.DataInputStream.readUnsignedShort(DataInputStream.java:340) >>>>> at java.io.DataInputStream.readUTF(DataInputStream.java:589) >>>>> at java.io.DataInputStream.readUTF(DataInputStream.java:564) >>>>> at >>>>> org.apache.cassandra.io.compress.CompressionMetadata.<init>(CompressionMetadata.java:83) >>>>> ... 11 more >>>>> >>>>> >>>>> On Friday, March 28, 2014 2:38 PM, Robert Coli <rc...@eventbrite.com> >>>>> wrote: >>>>> On Fri, Mar 28, 2014 at 12:21 PM, Russ Lavoie <ussray...@yahoo.com>wrote: >>>>> >>>>> Thank you for your quick response. >>>>> >>>>> Is there a way to tell when a snapshot is completely done? >>>>> >>>>> >>>>> IIRC, the JMX call blocks until the snapshot completes. It should be >>>>> done when nodetool returns. >>>>> >>>>> =Rob >>>>> >>>>> >>>>> >>>> >>>> >>>> -- >>>> Jon Haddad >>>> http://www.rustyrazorblade.com >>>> skype: rustyrazorblade >>>> >>> >>> >> >> >> -- >> Jon Haddad >> http://www.rustyrazorblade.com >> skype: rustyrazorblade >> > > -- Jon Haddad http://www.rustyrazorblade.com skype: rustyrazorblade