Apologies, I missed Paulo's reply on my email client threading funnies... On 4/11/21 7:50, Berenguer Blasi wrote: > What about an hourly heartbeat 'lastSeenAlive' timestamp? my 2cts. > > On 3/11/21 21:53, Stefan Miklosovic wrote: >> Hi, >> >> We see a lot of cases out there when a node was down for longer than >> the GC period and once that node is up there are a lot of zombie data >> issues ... you know the story. >> >> We would like to implement some kind of a check which would detect >> this so that node would not start in the first place so no issues >> would be there at all and it would be up to operators to figure out >> first what to do with it. >> >> There are a couple of ideas we were exploring with various pros and >> cons and I would like to know what you think about them. >> >> 1) Register a shutdown hook on "drain". This is already there (1). >> "drain" method is doing quite a lot of stuff and this is called on >> shutdown so our idea is to write a timestamp to system.local into a >> new column like "lastly_drained" or something like that and it would >> be read on startup. >> >> The disadvantage of this approach, or all approaches via shutdown >> hooks, is that it will only react only on SIGTERM and SIGINT. If that >> node is killed via SIGKILL, JVM just stops and there is basically >> nothing we have any guarantee of that would leave some traces behind. >> >> If it is killed and that value is not overwritten, on the next startup >> it might happen that it would be older than 10 days so it will falsely >> evaluate it should not be started. >> >> 2) Doing this on startup, you would check how old all your sstables >> and commit logs are, if no file was modified less than 10 days ago you >> would abort start, there is pretty big chance that your node did at >> least something in 10 days, there does not need to be anything added >> to system tables or similar and it would be just another StartupCheck. >> >> The disadvantage of this is that some dev clusters, for example, may >> run more than 10 days and they are just sitting there doing absolutely >> nothing at all, nobody interacts with them, nobody is repairing them, >> they are just sitting there. So when nobody talks to these nodes, no >> files are modified, right? >> >> It seems like there is not a silver bullet here, what is your opinion on >> this? >> >> Regards >> >> (1) >> https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/service/StorageService.java#L786-L799 >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org >> For additional commands, e-mail: dev-h...@cassandra.apache.org >> >> .
--------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org For additional commands, e-mail: dev-h...@cassandra.apache.org