Re: The most reliable way to determine the last time node was up

Elliott Sims Thu, 04 Nov 2021 12:57:31 -0700

To deal with this, I've just made a very small Bash script that looks at
commitlog age, then set the script as an "ExecStartPre=" in systemd:


if [[ -d '/opt/cassandra/data/data' && $(/usr/bin/find
/opt/cassandra/data/commitlog/ -name 'CommitLog*.log' -mtime -8 | wc -l)
-eq 0 ]]; then
  >&2  echo "ERROR:  precheck filed, Cassandra data too old"
  exit 10
fi

First conditional is to reduce false-positives on brand new machines with
no data.
I suspect it'll false-positive if your writes are extremely rare (that is,
basically read-only), but at that point you may not need it at all.
(adjust as needed for your grace period and paths)

On Thu, Nov 4, 2021 at 12:54 AM Berenguer Blasi <[email protected]>
wrote:

> Apologies, I missed Paulo's reply on my email client threading funnies...
>
> On 4/11/21 7:50, Berenguer Blasi wrote:
> > What about an hourly heartbeat 'lastSeenAlive' timestamp? my 2cts.
> >
> > On 3/11/21 21:53, Stefan Miklosovic wrote:
> >> Hi,
> >>
> >> We see a lot of cases out there when a node was down for longer than
> >> the GC period and once that node is up there are a lot of zombie data
> >> issues ... you know the story.
> >>
> >> We would like to implement some kind of a check which would detect
> >> this so that node would not start in the first place so no issues
> >> would be there at all and it would be up to operators to figure out
> >> first what to do with it.
> >>
> >> There are a couple of ideas we were exploring with various pros and
> >> cons and I would like to know what you think about them.
> >>
> >> 1) Register a shutdown hook on "drain". This is already there (1).
> >> "drain" method is doing quite a lot of stuff and this is called on
> >> shutdown so our idea is to write a timestamp to system.local into a
> >> new column like "lastly_drained" or something like that and it would
> >> be read on startup.
> >>
> >> The disadvantage of this approach, or all approaches via shutdown
> >> hooks, is that it will only react only on SIGTERM and SIGINT. If that
> >> node is killed via SIGKILL, JVM just stops and there is basically
> >> nothing we have any guarantee of that would leave some traces behind.
> >>
> >> If it is killed and that value is not overwritten, on the next startup
> >> it might happen that it would be older than 10 days so it will falsely
> >> evaluate it should not be started.
> >>
> >> 2) Doing this on startup, you would check how old all your sstables
> >> and commit logs are, if no file was modified less than 10 days ago you
> >> would abort start, there is pretty big chance that your node did at
> >> least something in 10 days, there does not need to be anything added
> >> to system tables or similar and it would be just another StartupCheck.
> >>
> >> The disadvantage of this is that some dev clusters, for example, may
> >> run more than 10 days and they are just sitting there doing absolutely
> >> nothing at all, nobody interacts with them, nobody is repairing them,
> >> they are just sitting there. So when nobody talks to these nodes, no
> >> files are modified, right?
> >>
> >> It seems like there is not a silver bullet here, what is your opinion
> on this?
> >>
> >> Regards
> >>
> >> (1)
> https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/service/StorageService.java#L786-L799
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: [email protected]
> >> For additional commands, e-mail: [email protected]
> >>
> >> .
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>

Re: The most reliable way to determine the last time node was up

Reply via email to