What about an hourly heartbeat 'lastSeenAlive' timestamp? my 2cts.

On 3/11/21 21:53, Stefan Miklosovic wrote:
> Hi,
>
> We see a lot of cases out there when a node was down for longer than
> the GC period and once that node is up there are a lot of zombie data
> issues ... you know the story.
>
> We would like to implement some kind of a check which would detect
> this so that node would not start in the first place so no issues
> would be there at all and it would be up to operators to figure out
> first what to do with it.
>
> There are a couple of ideas we were exploring with various pros and
> cons and I would like to know what you think about them.
>
> 1) Register a shutdown hook on "drain". This is already there (1).
> "drain" method is doing quite a lot of stuff and this is called on
> shutdown so our idea is to write a timestamp to system.local into a
> new column like "lastly_drained" or something like that and it would
> be read on startup.
>
> The disadvantage of this approach, or all approaches via shutdown
> hooks, is that it will only react only on SIGTERM and SIGINT. If that
> node is killed via SIGKILL, JVM just stops and there is basically
> nothing we have any guarantee of that would leave some traces behind.
>
> If it is killed and that value is not overwritten, on the next startup
> it might happen that it would be older than 10 days so it will falsely
> evaluate it should not be started.
>
> 2) Doing this on startup, you would check how old all your sstables
> and commit logs are, if no file was modified less than 10 days ago you
> would abort start, there is pretty big chance that your node did at
> least something in 10 days, there does not need to be anything added
> to system tables or similar and it would be just another StartupCheck.
>
> The disadvantage of this is that some dev clusters, for example, may
> run more than 10 days and they are just sitting there doing absolutely
> nothing at all, nobody interacts with them, nobody is repairing them,
> they are just sitting there. So when nobody talks to these nodes, no
> files are modified, right?
>
> It seems like there is not a silver bullet here, what is your opinion on this?
>
> Regards
>
> (1) 
> https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/service/StorageService.java#L786-L799
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
> For additional commands, e-mail: dev-h...@cassandra.apache.org
>
> .

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
For additional commands, e-mail: dev-h...@cassandra.apache.org

Reply via email to