Re: The most reliable way to determine the last time node was up

Stefan Miklosovic Wed, 03 Nov 2021 15:29:03 -0700

Yes this is the combination of system.local and "marker file"
approach, basically updating that field periodically.


However, when there is a mutation done against the system table (in
this example), it goes to a commit log and then it will be propagated
to sstable on disk, no? So in our hypothetical scenario, if a node is
not touched by anybody, it would still behave like it _does_
something. I would expect that if nobody talks to a node and no
operation is running, it does not produce any "side effects".

I just do not want to generate any unnecessary noise. A node which
does not do anything should not change its data. I am not sure if it
is like that already or if an inactive node still does writes new
sstables after some time, I doubt that.

On Wed, 3 Nov 2021 at 22:58, Paulo Motta <[email protected]> wrote:
>
> How about a last_checkpoint (or better name) system.local column that is
> updated periodically (ie. every minute) + on drain? This would give a lower
> time bound on when the node was last live without requiring an external
> marker file.
>
> On Wed, 3 Nov 2021 at 18:03 Stefan Miklosovic <
> [email protected]> wrote:
>
> > The third option would be to have some thread running in the
> > background "touching" some (empty) marker file, it is the most simple
> > solution but I do not like the idea of this marker file, it feels
> > dirty, but hey, while it would be opt-in feature for people knowing
> > what they want, why not right ...
> >
> > On Wed, 3 Nov 2021 at 21:53, Stefan Miklosovic
> > <[email protected]> wrote:
> > >
> > > Hi,
> > >
> > > We see a lot of cases out there when a node was down for longer than
> > > the GC period and once that node is up there are a lot of zombie data
> > > issues ... you know the story.
> > >
> > > We would like to implement some kind of a check which would detect
> > > this so that node would not start in the first place so no issues
> > > would be there at all and it would be up to operators to figure out
> > > first what to do with it.
> > >
> > > There are a couple of ideas we were exploring with various pros and
> > > cons and I would like to know what you think about them.
> > >
> > > 1) Register a shutdown hook on "drain". This is already there (1).
> > > "drain" method is doing quite a lot of stuff and this is called on
> > > shutdown so our idea is to write a timestamp to system.local into a
> > > new column like "lastly_drained" or something like that and it would
> > > be read on startup.
> > >
> > > The disadvantage of this approach, or all approaches via shutdown
> > > hooks, is that it will only react only on SIGTERM and SIGINT. If that
> > > node is killed via SIGKILL, JVM just stops and there is basically
> > > nothing we have any guarantee of that would leave some traces behind.
> > >
> > > If it is killed and that value is not overwritten, on the next startup
> > > it might happen that it would be older than 10 days so it will falsely
> > > evaluate it should not be started.
> > >
> > > 2) Doing this on startup, you would check how old all your sstables
> > > and commit logs are, if no file was modified less than 10 days ago you
> > > would abort start, there is pretty big chance that your node did at
> > > least something in 10 days, there does not need to be anything added
> > > to system tables or similar and it would be just another StartupCheck.
> > >
> > > The disadvantage of this is that some dev clusters, for example, may
> > > run more than 10 days and they are just sitting there doing absolutely
> > > nothing at all, nobody interacts with them, nobody is repairing them,
> > > they are just sitting there. So when nobody talks to these nodes, no
> > > files are modified, right?
> > >
> > > It seems like there is not a silver bullet here, what is your opinion on
> > this?
> > >
> > > Regards
> > >
> > > (1)
> > https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/service/StorageService.java#L786-L799
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [email protected]
> > For additional commands, e-mail: [email protected]
> >
> >

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: The most reliable way to determine the last time node was up

Reply via email to