Prior to cassandra-6696 you’d have to treat one missing disk as a failed machine, wipe all the data and re-stream it, as a tombstone for a given value may be on one disk and data on another (effectively redirecting data)
So the answer has to be version dependent, too - which version were you using? > On Jan 16, 2023, at 9:08 AM, Tolbert, Andy <x...@andrewtolbert.com> wrote: > > Hi Joe, > > Reading it back I realized I misunderstood that part of your email, so > you must be using data_file_directories with 16 drives? That's a lot > of drives! I imagine this may happen from time to time given that > disks like to fail. > > That's a bit of an interesting scenario that I would have to think > about. If you brought the node up without the bad drive, repairs are > probably going to do a ton of repair overstreaming if you aren't using > 4.0 (https://issues.apache.org/jira/browse/CASSANDRA-3200) which may > put things into a really bad state (lots of streaming = lots of > compactions = slower reads) and you may be seeing some inconsistency > if repairs weren't regularly running beforehand. > > How much data was on the drive that failed? How much data do you > usually have per node? > > Thanks, > Andy > >> On Mon, Jan 16, 2023 at 10:59 AM Joe Obernberger >> <joseph.obernber...@gmail.com> wrote: >> >> Thank you Andy. >> Is there a way to just remove the drive from the cluster and replace it >> later? Ordering replacement drives isn't a fast process... >> What I've done so far is: >> Stop node >> Remove drive reference from /etc/cassandra/conf/cassandra.yaml >> Restart node >> Run repair >> >> Will that work? Right now, it's showing all nodes as up. >> >> -Joe >> >>> On 1/16/2023 11:55 AM, Tolbert, Andy wrote: >>> Hi Joe, >>> >>> I'd recommend just doing a replacement, bringing up a new node with >>> -Dcassandra.replace_address_first_boot=ip.you.are.replacing as >>> described here: >>> https://cassandra.apache.org/doc/4.1/cassandra/operating/topo_changes.html#replacing-a-dead-node >>> >>> Before you do that, you will want to make sure a cycle of repairs has >>> run on the replicas of the down node to ensure they are consistent >>> with each other. >>> >>> Make sure you also have 'auto_bootstrap: true' in the yaml of the node >>> you are replacing and that the initial_token matches the node you are >>> replacing (If you are not using vnodes) so the node doesn't skip >>> bootstrapping. This is the default, but felt worth mentioning. >>> >>> You can also remove the dead node, which should stream data to >>> replicas that will pick up new ranges, but you also will want to do >>> repairs ahead of time too. To be honest it's not something I've done >>> recently, so I'm not as confident on executing that procedure. >>> >>> Thanks, >>> Andy >>> >>> >>> On Mon, Jan 16, 2023 at 9:28 AM Joe Obernberger >>> <joseph.obernber...@gmail.com> wrote: >>>> Hi all - what is the correct procedure when handling a failed disk? >>>> Have a node in a 15 node cluster. This node has 16 drives and cassandra >>>> data is split across them. One drive is failing. Can I just remove it >>>> from the list and cassandra will then replicate? If not - what? >>>> Thank you! >>>> >>>> -Joe >>>> >>>> >>>> -- >>>> This email has been checked for viruses by AVG antivirus software. >>>> www.avg.com >> >> -- >> This email has been checked for viruses by AVG antivirus software. >> www.avg.com