I got some very good advice on manual compaction so I thought I would throw
out another question on raid/backup strategies for production clusters.

We are debating going with raid 0 vs. raid 10 on our nodes for data storage.
Currently all storage we use is raid 10 as drives always fail and raid 10
basically makes a drive failure a non event. With Cassandra and a
replication factor of 3 we start thinking that maybe raid 0 is good enough.
Also since we are buying a lot more inexpensive servers raid 0 just seems to
hit that price point a lot more.

The problem now becomes how do we deal with the drives that WILL fail in a
raid 0 node? We are trying to use snapshots etc. to back up the data but it
is slow (hours) and slows down the entire node. We assume this will work if
we backup every 2 days at the least in that hinted handoff/reads could help
bring the data back into sync. If we can not backup every 1-2 days then we
are stuck with nodetool repair, decommission, etc. and using some of
Cassandra's build in capabilities but here things become more out of our
control and we are "afraid" to trust it. Like many in recent posts we have
been less than successful in testing this out in the .6.x branch.

Can anyone share their decisions for the same and how they managed to deal
with these issues? Coming from the relational world raid 10 has been an
"assumption" for years, and we are not sure whether this assumption should
be dropped or held on to. Our nodes in dev are currently around 500Gb so for
us the question is how can we restore a node with this amount of data and
how long will it take? Drives can and will fail, how can we make recovery a
non event? What is our total recovery time window? We want it to be in hours
after drive replacement (which will be in minutes).

Thanks.

Wayne

Reply via email to