Re: [DISCUSS] inotify for detection of manually removed snapshots

Bowen Song via dev Fri, 09 Aug 2024 07:49:30 -0700

If we have the documentation in place, we can then consider the cache tobe the master copy of metadata, and rely on it to be always accurate andup to date. If someone deletes the snapshot files from filesystem, theycan't complain about Cassandra stopped working correctly - which is thesame if they had manually deleted some SSTable files (they shouldn't).


On 09/08/2024 11:16, Štefan Miklošovič wrote:

We could indeed do that. Does your suggestion mean that there shouldnot be a problem with caching it all once explicitly stated like that?

On Fri, Aug 9, 2024 at 12:01 PM Bowen Song via dev<dev@cassandra.apache.org> wrote:


    Has anyone considered simply updating the documentation saying this?

    "Removing the snapshot files directly from the filesystem may
    break things. Always use the `nodetool` command or JMX to remove
    snapshots."

    On 09/08/2024 09:18, Štefan Miklošovič wrote:

    If we consider caching it all to be too much, we might probably
    make caching an option an admin would need to opt-in into? There
    might be a flag in cassandra.yaml, once enabled, it would be in
    memory, otherwise it would just load it as it was so people can
    decide if caching is enough for them or they want to have it as
    it was before (would be by default set to as it was). This puts
    additional complexity into SnapshotManager but it should be in
    general doable.

    Let me know what you think, I would really like to have this
    resolved, 18111 brings a lot of code cleanup and simplifies stuff
    a lot.

    On Wed, Aug 7, 2024 at 11:30 PM Josh McKenzie
    <jmcken...@apache.org> wrote:

        If you have a lot of snapshots and have for example a metric
        monitoring them and their sizes, if you don’t cache it,
        creating the metric can cause performance degradation. We
        added the cache because we saw this happen to databases more
        than once.

        I mean, I believe you, I'm just surprised querying out
        metadata for files and basic computation is leading to
        hundreds of ms pause times even on systems with a lot of
        files. Aren't most / all of these values cached at the
        filesystem layer so we're basically just tomato / tomahto
        caching systems, either one we maintain or one the OS maintains?

        Or is there really just a count of files well outside what
        I'm thinking here?

        Anyway, not trying to cause a ruckus and make needless noise,
        trying to learn. ;)


        On Wed, Aug 7, 2024, at 3:20 PM, Štefan Miklošovič wrote:



        On Wed, Aug 7, 2024 at 6:39 PM Yifan Cai
        <yc25c...@gmail.com> wrote:

            With WatcherService, when events are missed (which is to
            be expected), you will still need to list the files. It
            seems to me that WatcherService doesn't offer
            significant benefits in this case.


        Yeah I think we leave it out eventually.


            Regarding listing directory with a refresh flag, my
            concern is the potential for abuse. End-users
            might/could always refresh before listing, which could
            undermine the purpose of caching. Perhaps Jeremiah can
            provide more insight on this.


        Well, by default, it would not be refreshed every single
        time. You would need to opt-in into that. If there is a shop
        which has users with a direct access to the disk of
        Cassandra nodes and they are removing data manually, I do
        not know what to say, what is nodetool clearsnapshot and jmx
        methods good for then? I do not think we can prevent people
        from shooting into their feet if they are absolutely willing
        to do that.

        If they want to refresh that every time, that would be equal
        to the current behavior. It would be at most as "bad" as it
        is now.

            IMO, caching is best handled internally. I have a few
            UX-related questions:
            - Is it valid or acceptable to return stale data? If so,
            end-users have to do some form of validation before
            consuming each snapshot to account for potential deletions.


        answer below

            - Even if listsnapshot returns the most recent data, is
            it possible that some of the directories get deleted
            when end-users are accessing them? I think it is true.
            It, then, enforces end-users to do some validation
            first, similar to handling stale data.


        I think that what you were trying to say is that when at
        time T0 somebody lists snapshots and at T1 somebody removes
        a snapshot manually then the list of snapshots is not actual
        anymore? Sure. That is a thing. This is how it currently works.

        Now, we want to cache them, so if you clear a snapshot which
        is not physically there because somebody removed it
        manually, that should be a no-op, it will be just removed
        from the internal tracker. So, if it is at disk and in cache
        and you clear it, then all is fine. It is fine too if it is
        not on disk anymore and you clear it, then it is just
        removed internally. It would fail only in case you want to
        remove a snapshot which is not cached, regardless whether it
        is on disk or not. The deletion of non-existing snapshot
        ends up with a failure, nothing should be changed in that
        regard, this is the current behavior too.

        I want to say that I did not write it completely correctly
        at the very beginning of this thread. Currently, we are
        caching only _expiring_ snapshots, because we need to know
        what is their time of removal so we act on it later.
        _normal_ snapshots are _not_ cached _yet_. I spent so much
        time with 18111 that I live in a reality where it is already
        in, I forgot this is not actually in place yet, we are very
        close to that.

        OK thank you all for your insights, I will NOT use inotify.


            Just my 2 cents.

            - Yifan

            On Wed, Aug 7, 2024 at 6:03 AM Štefan Miklošovič
            <smikloso...@apache.org> wrote:

                Yes, for example as reported here

                https://issues.apache.org/jira/browse/CASSANDRA-13338

                People who are charting this in monitoring
                dashboards might also hit this.

                On Wed, Aug 7, 2024 at 2:59 PM J. D. Jordan
                <jeremiah.jor...@gmail.com> wrote:

                    If you have a lot of snapshots and have for
                    example a metric monitoring them and their
                    sizes, if you don’t cache it, creating the
                    metric can cause performance degradation. We
                    added the cache because we saw this happen to
                    databases more than once.

                    > On Aug 7, 2024, at 7:54 AM, Josh McKenzie
                    <jmcken...@apache.org> wrote:
                    >
                    > 
                    >>
                    >> Snapshot metadata are currently stored in
                    memory / they are cached so we do not need to go
                    to disk every single time we want to list them,
                    the more snapshots we have, the worse it is.
                    > Are we enumerating our snapshots somewhere on
                    the hot path, or is this performance concern
                    misplaced?
                    >
                    >> On Wed, Aug 7, 2024, at 7:44 AM, Štefan
                    Miklošovič wrote:
                    >> Snapshot metadata are currently stored in
                    memory / they are cached so we do not need to go
                    to disk every single time we want to list them,
                    the more snapshots we have, the worse it is.
                    >>
                    >> When a snapshot is _manually_ removed from
                    disk, not from nodetool clearsnapshot, just by
                    rm -rf on a respective snapshot directory, then
                    such snapshot will be still visible in nodetool
                    listsnapshots. Manual removal of a snapshot
                    might be done e.g. by accident or by some
                    "impatient" operator who just goes to disk and
                    removes it there instead of using nodetool or
                    respective JMX method.
                    >>
                    >> To improve UX here, what I came up with is
                    that we might use Java's WatchService where each
                    snapshot dir would be registered. WatchService
                    is part of Java, it uses inotify subsystem which
                    is what Linux kernel offers. The result of doing
                    it is that once a snapshot dir is registered to
                    be watched and when it is removed then we are
                    notified about that via inotify / WatchService
                    so we can react on it and remove the in-memory
                    representation of that so it will not be visible
                    in the output anymore.
                    >>
                    >> While this works, there are some questions /
                    concerns
                    >>
                    >> 1) What do people think about inotify in
                    general? I tested this on 10k snapshots and it
                    seems to work just fine, nevertheless there is
                    in general no strong guarantee that every single
                    event will come through, there is also a family
                    of kernel parameters around this where more
                    tuning can be done etc. It is also questionable
                    how this will behave on other systems from Linux
                    (Mac etc). While JRE running on different
                    platforms also implements this, I am not
                    completely sure these implementations are
                    quality-wise the same as for Linux etc. There is
                    a history of not-so-quality implementations for
                    other systems (events not coming through on Macs
                    etc) and while I think we are safe on Linux, I
                    am not sure we want to go with this elsewhere.
                    >>
                    >> 2) inotify brings more entropy into the
                    codebase, it is another thing we need to take
                    care of etc (however, it is all concentrated in
                    one class and pretty much "isolated" from
                    everything else)
                    >>
                    >> I made this feature optional and it is turned
                    off by default so people need to explicitly
                    opt-in into this so we are not forcing it on
                    anybody.
                    >>
                    >> If we do not want to go with inotify, another
                    option would be to have a background thread
                    which would periodically check if a manifest
                    exists on a disk, if it does not, then a
                    snapshot does not either. While this works, what
                    I do not like about this is that the primary
                    reason we moved it to memory was to bypass IO as
                    much as possible yet here we would introduce
                    another check which would go to disk, and this
                    would be done periodically, which beats the
                    whole purpose. If an operator lists snapshots
                    once a week and there is a background check
                    running every 10 minutes (for example), then the
                    cummulative number of IO operations migth be
                    bigger than us just doing nothing at all. For
                    this reason, if we do not want to go with
                    inotify, I would also not implement any
                    automatic background check. Instead of that,
                    there would be SnapshotManagerMbean#refresh()
                    method which would just forcibly reload all
                    snapshots from scratch. I think that manual
                    deletion of snapshots is not something a user
                    would do regularly, snapshots are meant to be
                    removed via nodetool or JMX. If manual removal
                    ever happens then in order to make it
                    synchronized again, the refreshing of these
                    snapshots would be required. There might be an
                    additional flag in nodetool listsnapshots, once
                    specified, refreshing would be done, otherwise not.
                    >>
                    >> How does this all sound to people?
                    >>
                    >> Regards
                    >

Re: [DISCUSS] inotify for detection of manually removed snapshots

Reply via email to