Re: The most reliable way to determine the last time node was up

Berenguer Blasi Wed, 03 Nov 2021 23:54:10 -0700

Apologies, I missed Paulo's reply on my email client threading funnies...

On 4/11/21 7:50, Berenguer Blasi wrote:
> What about an hourly heartbeat 'lastSeenAlive' timestamp? my 2cts.
>
> On 3/11/21 21:53, Stefan Miklosovic wrote:
>> Hi,
>>
>> We see a lot of cases out there when a node was down for longer than
>> the GC period and once that node is up there are a lot of zombie data
>> issues ... you know the story.
>>
>> We would like to implement some kind of a check which would detect
>> this so that node would not start in the first place so no issues
>> would be there at all and it would be up to operators to figure out
>> first what to do with it.
>>
>> There are a couple of ideas we were exploring with various pros and
>> cons and I would like to know what you think about them.
>>
>> 1) Register a shutdown hook on "drain". This is already there (1).
>> "drain" method is doing quite a lot of stuff and this is called on
>> shutdown so our idea is to write a timestamp to system.local into a
>> new column like "lastly_drained" or something like that and it would
>> be read on startup.
>>
>> The disadvantage of this approach, or all approaches via shutdown
>> hooks, is that it will only react only on SIGTERM and SIGINT. If that
>> node is killed via SIGKILL, JVM just stops and there is basically
>> nothing we have any guarantee of that would leave some traces behind.
>>
>> If it is killed and that value is not overwritten, on the next startup
>> it might happen that it would be older than 10 days so it will falsely
>> evaluate it should not be started.
>>
>> 2) Doing this on startup, you would check how old all your sstables
>> and commit logs are, if no file was modified less than 10 days ago you
>> would abort start, there is pretty big chance that your node did at
>> least something in 10 days, there does not need to be anything added
>> to system tables or similar and it would be just another StartupCheck.
>>
>> The disadvantage of this is that some dev clusters, for example, may
>> run more than 10 days and they are just sitting there doing absolutely
>> nothing at all, nobody interacts with them, nobody is repairing them,
>> they are just sitting there. So when nobody talks to these nodes, no
>> files are modified, right?
>>
>> It seems like there is not a silver bullet here, what is your opinion on 
>> this?
>>
>> Regards
>>
>> (1) 
>> https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/service/StorageService.java#L786-L799
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
>> For additional commands, e-mail: dev-h...@cassandra.apache.org
>>
>> .


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
For additional commands, e-mail: dev-h...@cassandra.apache.org

Re: The most reliable way to determine the last time node was up

Reply via email to