Re: Node failed after drive failed

Bowen Song Mon, 13 Dec 2021 10:42:04 -0800

AFAIK there's nothing special being the first disk in the list. It'sfairly likely that you were just "lucky" that an important part of thesystem keysapce was on that disk. It's always a good idea to keep somespare hardware at hand, because you will never know when will they beneeded.


On 13/12/2021 18:01, Joe Obernberger wrote:

Thank you Bowen. I had the policy set to "best_effort", but as Jeffpointed out since it was the first disk in the list that failed maybethat is a special case?
I don't have a spare drive at the moment, so I'll just delete all thecassandra data on that node and have it rejoin as a new node.
-Joe

On 12/11/2021 3:44 PM, Bowen Song wrote:
Hi Joe,
In case of a single disk failure, you should not remove the datadirectory from the cassandra.yaml file. Instead, you should replacethe failed disk with a new empty disk. Seehttps://docs.datastax.com/en/cassandra-oss/3.x/cassandra/operations/opsRecoverUsingJBOD.htmlfor the steps.
Since your node failed to start, I guess it's not too late to restorethe settings in the cassandra.yaml file and then follow the abovesteps. However, replacing the entire node is always an option ifeverything else has failed, as long as you have RF>1 and other nodesin the cluster are all healthy. If you need to do this, follow thesteps here:https://docs.datastax.com/en/cassandra-oss/3.0/cassandra/operations/opsReplaceNode.html
As of your last question,
> /When a drive fails with cassandra, is it common for the node tocome down? /
this actually depends on the disk_failure_policy in yourcassandra.yaml file, read the comments in it will help you understandthe available choices.
Cheers,
Bowen

On 06/12/2021 14:11, Joe Obernberger wrote:
Hi All - one node in an 11 node cluster experienced a drive failureon the first drive in the list. I removed that drive from the listso that it now reads:
data_file_directories:
    - /data/2/cassandra/data
    - /data/3/cassandra/data
    - /data/4/cassandra/data
    - /data/5/cassandra/data
    - /data/6/cassandra/data
    - /data/8/cassandra/data
    - /data/9/cassandra/data

But when I try to start the server, I get:
Exception (java.lang.RuntimeException) encountered during startup: Anode with address /172.16.100.251:7000 already exists, cancellingjoin. Use cassandra.replace_address if you want to replace this node.java.lang.RuntimeException: A node with address /172.16.100.251:7000already exists, cancelling join. Use cassandra.replace_address ifyou want to replace this node. atorg.apache.cassandra.service.StorageService.checkForEndpointCollision(StorageService.java:659) atorg.apache.cassandra.service.StorageService.prepareToJoin(StorageService.java:934) atorg.apache.cassandra.service.StorageService.initServer(StorageService.java:784) atorg.apache.cassandra.service.StorageService.initServer(StorageService.java:729) atorg.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:420) atorg.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:763) atorg.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:887)ERROR [main] 2021-12-05 15:49:48,446 CassandraDaemon.java:909 -Exception encountered during startupjava.lang.RuntimeException: A node with address /172.16.100.251:7000already exists, cancelling join. Use cassandra.replace_address ifyou want to replace this node. atorg.apache.cassandra.service.StorageService.checkForEndpointCollision(StorageService.java:659) atorg.apache.cassandra.service.StorageService.prepareToJoin(StorageService.java:934) atorg.apache.cassandra.service.StorageService.initServer(StorageService.java:784) atorg.apache.cassandra.service.StorageService.initServer(StorageService.java:729) atorg.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:420) atorg.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:763) atorg.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:887)INFO [StorageServiceShutdownHook] 2021-12-05 15:49:48,468HintsService.java:220 - Paused hints dispatchWARN [StorageServiceShutdownHook] 2021-12-05 15:49:48,470Gossiper.java:1993 - No local state, state is in silent shutdown, ornode hasn't joined, not announcing shutdown
Do I need to remove and re-add the node? When a drive fails withcassandra, is it common for the node to come down?
Thank you!

-Joe Obernberger
<http://www.avg.com/email-signature?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=emailclient>Virus-free. www.avg.com<http://www.avg.com/email-signature?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=emailclient>
<#DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2>

Re: Node failed after drive failed

Reply via email to