Re: Disaster recovery question

Mikhail Stepura Sat, 16 Nov 2013 16:15:16 -0800

Looks like someone has the same (1-4) questions:
https://issues.apache.org/jira/browse/CASSANDRA-6364

-M

"graham sanderson" wrote in messagenews:7161e7e0-cf24-4b30-b9ca-2faafb0c4...@vast.com...

We are currently looking to deploy on the 2.0 line of cassandra, butobviously are watching for bugs (we are currently on 2.0.2) - we are awareof a couple of interesting known bugs to be fixed in 2.0.3 and one in 2.1,but none have been observed (in production use cases) or are likely toaffect our current proposed deployment.


I have a few general questions:

The first particular test we tried was to physically remove the SSD commitdrive for one of the nodes whilst under HEAVY write load (maybe a fewhundred MB/s of data to be replicated 3 times - 6 node single local datacenter) and also while running read performance tests.. We currently haveboth node (CQL3) and Astyanax (Thrift) clients.

Frankly everything was pretty good (no read/write failures or indeed(observed) latency issues) except, and maybe people can comment on any ofthese:

1) There were NO errors in the log on the node where we removed the commitlog SSD drive - this surprised us (of course our ops monitoring would detectthe downed disk too, but we hope to be able to look for ERROR level loggingin system.log to cause alerts also)

2) The node with no commit log disk just kept writing to memtables, but:

3) This was causing major CMS GC issues which eventually caused the node toappear down (nodetool status) to all other nodes, and indeed it itself sawall other nodes as down. That said dynamic snitch and latency detection inclients seemed to prevent this being much of a problem other than it seemspotentially undesirable from a server side standpoint.4) nodetool gossipinfo didn▓t report anything abnormal for any nodes whenrun from any node.

Sadly because of an Astyanax issue (we were using the thrift code path thatdoes a (now unnecessary) describe cluster to check for schema disagreementbefore schema changes) we weren▓t able to create a new CF with a node markeddown, and thus couldn▓t immediately add more data to see what would havehappened: EOM or failure (we have since fixed this to go thru CQL3 code pathbut not yet re-run the tests because of other application level testinggoing on)┘ that said maybe someone knows off the top of their head if thereis a config setting that would start failing writes (due to memtable size)before GC became an issue, and we just have this misconfigured.

Secondly, our test was perhaps unrealistic in that when we brought the nodeback up, we did so with the partial commit log on the replaced disk intact(but the memory data lost), but we did get the following sorts of errors:

At level 1,SSTableReader(path='/data/2/cassandra/searchapi_dsp_approved_feed_beta/20131113151746_20131113_140712_1384348032/searchapi_dsp_approved_feed_beta-20131113151746_20131113_140712_1384348032-jb-12-Data.db')[DecoratedKey(3508309769529441563,2d37363730383735353837333637383432323934), DecoratedKey(9158434231083901894,343934353436393734343637393130393335)] overlapsSSTableReader(path='/data/5/cassandra/searchapi_dsp_approved_feed_beta/20131113151746_20131113_140712_1384348032/searchapi_dsp_approved_feed_beta-20131113151746_20131113_140712_1384348032-jb-6-Data.db')[DecoratedKey(7446234284568345539, 33393230303730373632303838373837373436),DecoratedKey(9158426253052616687, 2d313430303837343831393637343030313136)].This could be caused by a bug in Cassandra 1.1.0 .. 1.1.3 or due to the factthat you have dropped sstables from another node into the data directory.Sending back to L0. If you didn't drop in sstables, and have not yet runscrub, you should do so since you may also have rows out-of-order within ansstable


5) I guess the question is what is the best way to bring up a failed node
a) delete all data first?

b) clear data but restore from previous sstable from backup to miminisesubsequent data transfer

c) other suggestions

6) Our experience is that taking nodes down that have problems, thendeleting data (subsets if we can see partial corruption) and re-adding ismuch safer (but our cluster is VERY fast). That said can we re-sync databefore re-enabling gossip, or at least before serving read requests fromthose nodes (not a huge issue but it would mitigate consistency issues withpartially recovered data in the case that multiple quorum read members wererecovering) - note we fallback from (LOCAL_)QUORUM to (LOCAL_ONE) onUnavaibleException, so have less guarantee compared with both writing andreading at LOCAL_QUORUM (note that if our LOCAL_QUORUM writes fail we willjust retry when the cluster is fixed - stale data is not ideal but OK for awhile)

That said given that the commit log on disk pre-dated any uncommitted lostmemtable data, it seems that we shouldn▓t have seen exceptions because thisis kind of like 5)b) in that it should have gotten us closer to the correctstate before the rest of the data was repaired rather than causing anyweirdness (unless it was a missed fsync problem), but maybe I▓m being naive.


Sorry for the long post, any thoughts would be appreciated.

Thanks,

Graham.

Re: Disaster recovery question

Reply via email to