Another thing to keep in mind is that if you are hitting the issue I
described, waiting 60 seconds will not absolutely solve your problem, it
will only make it less likely to occur.  If a memtable has been partially
flushed at the 60 second mark you will end up with the same corrupt sstable.


On Fri, Mar 28, 2014 at 1:32 PM, Laing, Michael
<michael.la...@nytimes.com>wrote:

> +1 for tablesnap
>
>
> On Fri, Mar 28, 2014 at 4:28 PM, Jonathan Haddad <j...@jonhaddad.com>wrote:
>
>> I will +1 the recommendation on using tablesnap over EBS.  S3 is at least
>> predictable.
>>
>> Additionally, from a practical standpoint, you may want to back up your
>> sstables somewhere.  If you use S3, it's easy to pull just the new tables
>> out via aws-cli tools (s3 sync), to your remote, non-aws server, and not
>> incur the overhead of routinely backing up the entire dataset.  For a non
>> trivial database, this matters quite a bit.
>>
>>
>> On Fri, Mar 28, 2014 at 1:21 PM, Laing, Michael <
>> michael.la...@nytimes.com> wrote:
>>
>>> As I tried to say, EBS snapshots require much care or you get corruption
>>> such as you have encountered.
>>>
>>> Does Cassandra quiesce the file system after a snapshot using fsfreeze
>>> or xfs_freeze? Somehow I doubt it...
>>>
>>>
>>> On Fri, Mar 28, 2014 at 4:17 PM, Jonathan Haddad <j...@jonhaddad.com>wrote:
>>>
>>>> I have a nagging memory of reading about issues with virtualization and
>>>> not actually having durable versions of your data even after an fsync
>>>> (within the VM).  Googling around lead me to this post:
>>>> http://petercai.com/virtualization-is-bad-for-database-integrity/
>>>>
>>>> It's possible you're hitting this issue, with with the virtualization
>>>> layer, or with EBS itself.  Just a shot in the dark though, other people
>>>> would likely know much more than I.
>>>>
>>>>
>>>>
>>>> On Fri, Mar 28, 2014 at 12:50 PM, Russ Lavoie <ussray...@yahoo.com>wrote:
>>>>
>>>>> Robert,
>>>>>
>>>>> That is what I thought as well.  But apparently something is
>>>>> happening.  The only way I can get away with doing this is adding a sleep
>>>>> 60 right after the nodetool snapshot is executed.  I can reproduce this
>>>>> 100% of the time by not issuing a sleep after nodetool snapshot.
>>>>>
>>>>> This is the error.
>>>>>
>>>>> ERROR [SSTableBatchOpen:1] 2014-03-28 17:08:14,290
>>>>> CassandraDaemon.java (line 191) Exception in thread
>>>>> Thread[SSTableBatchOpen:1,5,main]
>>>>> org.apache.cassandra.io.sstable.CorruptSSTableException:
>>>>> java.io.EOFException
>>>>> at
>>>>> org.apache.cassandra.io.compress.CompressionMetadata.<init>(CompressionMetadata.java:108)
>>>>> at
>>>>> org.apache.cassandra.io.compress.CompressionMetadata.create(CompressionMetadata.java:63)
>>>>>  at
>>>>> org.apache.cassandra.io.util.CompressedPoolingSegmentedFile$Builder.complete(CompressedPoolingSegmentedFile.java:42)
>>>>> at
>>>>> org.apache.cassandra.io.sstable.SSTableReader.load(SSTableReader.java:407)
>>>>>  at
>>>>> org.apache.cassandra.io.sstable.SSTableReader.open(SSTableReader.java:198)
>>>>> at
>>>>> org.apache.cassandra.io.sstable.SSTableReader.open(SSTableReader.java:157)
>>>>> at
>>>>> org.apache.cassandra.io.sstable.SSTableReader$1.run(SSTableReader.java:262)
>>>>> at
>>>>> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
>>>>> at java.util.concurrent.FutureTask.run(FutureTask.java:262)
>>>>>  at
>>>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>>>>> at
>>>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>>>>> at java.lang.Thread.run(Thread.java:744)
>>>>> Caused by: java.io.EOFException
>>>>> at java.io.DataInputStream.readUnsignedShort(DataInputStream.java:340)
>>>>> at java.io.DataInputStream.readUTF(DataInputStream.java:589)
>>>>> at java.io.DataInputStream.readUTF(DataInputStream.java:564)
>>>>> at
>>>>> org.apache.cassandra.io.compress.CompressionMetadata.<init>(CompressionMetadata.java:83)
>>>>>  ... 11 more
>>>>>
>>>>>
>>>>>   On Friday, March 28, 2014 2:38 PM, Robert Coli <rc...@eventbrite.com>
>>>>> wrote:
>>>>>  On Fri, Mar 28, 2014 at 12:21 PM, Russ Lavoie <ussray...@yahoo.com>wrote:
>>>>>
>>>>> Thank you for your quick response.
>>>>>
>>>>> Is there a way to tell when a snapshot is completely done?
>>>>>
>>>>>
>>>>> IIRC, the JMX call blocks until the snapshot completes. It should be
>>>>> done when nodetool returns.
>>>>>
>>>>> =Rob
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Jon Haddad
>>>> http://www.rustyrazorblade.com
>>>> skype: rustyrazorblade
>>>>
>>>
>>>
>>
>>
>> --
>> Jon Haddad
>> http://www.rustyrazorblade.com
>> skype: rustyrazorblade
>>
>
>


-- 
Jon Haddad
http://www.rustyrazorblade.com
skype: rustyrazorblade

Reply via email to