Hi,

We see a lot of cases out there when a node was down for longer than
the GC period and once that node is up there are a lot of zombie data
issues ... you know the story.

We would like to implement some kind of a check which would detect
this so that node would not start in the first place so no issues
would be there at all and it would be up to operators to figure out
first what to do with it.

There are a couple of ideas we were exploring with various pros and
cons and I would like to know what you think about them.

1) Register a shutdown hook on "drain". This is already there (1).
"drain" method is doing quite a lot of stuff and this is called on
shutdown so our idea is to write a timestamp to system.local into a
new column like "lastly_drained" or something like that and it would
be read on startup.

The disadvantage of this approach, or all approaches via shutdown
hooks, is that it will only react only on SIGTERM and SIGINT. If that
node is killed via SIGKILL, JVM just stops and there is basically
nothing we have any guarantee of that would leave some traces behind.

If it is killed and that value is not overwritten, on the next startup
it might happen that it would be older than 10 days so it will falsely
evaluate it should not be started.

2) Doing this on startup, you would check how old all your sstables
and commit logs are, if no file was modified less than 10 days ago you
would abort start, there is pretty big chance that your node did at
least something in 10 days, there does not need to be anything added
to system tables or similar and it would be just another StartupCheck.

The disadvantage of this is that some dev clusters, for example, may
run more than 10 days and they are just sitting there doing absolutely
nothing at all, nobody interacts with them, nobody is repairing them,
they are just sitting there. So when nobody talks to these nodes, no
files are modified, right?

It seems like there is not a silver bullet here, what is your opinion on this?

Regards

(1) 
https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/service/StorageService.java#L786-L799

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
For additional commands, e-mail: dev-h...@cassandra.apache.org

Reply via email to