That true / false for "force" argument seems to be purely derived from whether it is global repair (force set to false) or not (set to true).
https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/repair/RepairMessageVerbHandler.java#L185-L192 On Tue, May 13, 2025 at 5:25 PM David Capwell <dcapw...@apple.com> wrote: > That method can be backported even if it’s just for this single case. For > the vtable I needed something more friendly so made this for that. It’s > also used for dedup logic in repair retry works. > > I didn’t look closely, what sets force to be true or false? I assumed this > was the force flag in repair (as these are repair snapshots) but sounds > like that isn’t the case? > > Sent from my iPhone > > On May 12, 2025, at 11:58 PM, Štefan Miklošovič <smikloso...@apache.org> > wrote: > > > David, > > that "force" you copied from nodetool, that has nothing to do with "force" > in these snapshots but I think you already know that. > > org.apache.cassandra.repair.RepairJobDesc#determanisticId is only in 4.1+ > included, not in 4.0. That's quite a bummer but we could limit the patch > for 4.1+ only and leave 4.0 be. > > When we do that, then "force" will become redundant and we can perhaps > remove it? > > How I look at it is that the introduction of "force" flag in > TableRepairManager is the result of a developer trying to consolidate the > name of snapshots from parent id / session id to both parent id / parent id > but there was the realisation of that being problematic because else branch > might be invoked multiple times so it would start to clash. So they said > "ok and here we just override and take no matter what" but the > implementation of that was not finished as it clearly fails. > > So if we moved to determanisticId for both branches of if / else and > remove force, we should be good? RepairJobDesc#deterministicId is basically > unique? I see it is created from parent id, session id, keyspace, table and > ranges. I do not think there would ever be two cases of snapshots like that > with exact same values. > > WDYT? > > Regards > > On Mon, May 12, 2025 at 6:55 PM David Capwell <dcapw...@apple.com> wrote: > >> > "force" can be true / false. When true, then it will not check if a >> snapshot exists >> >> >> My mental node for “force” was only to deal with down nodes, and nothing >> to do with snapshots… so this feels like a weird behavior >> >> @Option(title = "force", name = {"-force", "--force"}, description = "Use >> -force to filter out down endpoints”) >> >> Our documentation even only calls out dealing with downed enodes… >> >> >> > a) we will remove existing snapshot and take a new one >> >> So this is racy and can make repairs brittle. The snapshot is at the >> RepairSession (global) or RepairJob level (a subset of ranges for a single >> table, different RepairJobs can work on the same table, just different >> ranges). With this code path you also have 1 more variable at play that >> makes things complex: isGlobal >> >> public boolean isGlobal() {return dataCenters.isEmpty() && >> hosts.isEmpty();} >> >> Global does the full range, where as non-global covers the partial range >> >> So when you do host or dc this snapshot is isolated to the RepairJob and >> not the session logically, but physically its isolated to the session; >> which doesn’t make any sense. >> >> > 2) if we do not want to fix "force", then it is most probably just >> redundant and we would remove it but it order to not fail, we would need to >> go back to naming snapshots not parent id / parent id but parent id / >> session id for global and non-global repair respectively (basically we >> would return to pre-14116 behavior). >> >> org.apache.cassandra.repair.RepairJobDesc#determanisticId >> >> This is scoped to a single RepairJob, so wouldn’t have the issues with >> concurrency. So if we were going to alter what the snapshot name is, I >> would strongly prefer this (its also present in the repair vtable, so not >> some hidden UUID) >> >> I personally feel the lowest risk is to switch to the job name and away >> from the session name we had… I do wonder about removing this “force” >> semantic as its not documented to users as far as I can tell, so is there >> any place that defines this behavior? >> >> > we might just fix the trunk and keep the rest broken, not ideal but >> >> With some patches its hard to tell if something is a bug or a feature… so >> altering the semantic is that making an improvement to allow better >> combinations of repair options, or is it fixing a bug (aka worthy of a back >> port)…. I know most users do parallel repair (default) so this snapshot >> logic doesn’t impact most users, but its also why its been so long before >> this issue got noticed… so coverage around snapshots currently isn’t the >> best. >> >> Sorry for this reply, its less of an answer and more a dump while I think >> through the problem you stated…. >> >> > On May 12, 2025, at 9:18 AM, Štefan Miklošovič <smikloso...@apache.org> >> wrote: >> > >> > Hello, >> > >> > I am looking for advice for (1). There is quite a quality archaeology >> to go through. I have asked around already (thanks for that!) but nobody >> seems to ultimately remember / tell what it is for. >> > >> > There is (2). "force" can be true / false. When true, then it will not >> check if a snapshot exists. If it exists and "force" is true, this has >> never worked - that is what CASSANDRA-20490 is for. It would try to hard >> link twice and it would error out. >> > >> > This "force" logic is called here (3). If ParentRepairSession (prs) is >> global, "force" is false, but if it is not (not global means that some dc >> was specified), then it is true. >> > >> > Notice that both branches of that if/else use >> desc.parentSessionId.toString() as snapshots' name. >> > >> > This "force" logic was introduced in (4) but it seems to me that it >> never worked. AFAICT this was just broken on delivery. >> > >> > One commit before (4), it looked like (5). The first branch of "if" >> used parentSessionId as a snapshot name and it checked if it exists or not >> before taking it, the else branch took sessionId (not parent session id) >> and took a snapshot of that name without any check if it exists or not. >> > >> > From javadoc on TableRepairManager, "force" means: (introduced in (4) >> CASSANDRA-14116) >> > >> > "If force is true, a snapshot should be taken even if one already >> exists with that name.". >> > >> > Questions: >> > >> > 1) do we want to fix the behaviour of "force"? (patch 20490) >> > if the answer is yes, what is the "semantics" about this. What does >> "force" mean exactly? It can mean >> > a) we will remove existing snapshot and take a new one >> > b) we will add more SSTables onto existing snapshot, we just hardlink >> more SSTables into same snapshot (that is what 20490 is doing) >> > >> > 2) if we do not want to fix "force", then it is most probably just >> redundant and we would remove it but it order to not fail, we would need to >> go back to naming snapshots not parent id / parent id but parent id / >> session id for global and non-global repair respectively (basically we >> would return to pre-14116 behavior). >> > >> > I think the introduction of "force" flag was there because there might >> be multiple calls to "else" branch in RepairMessageVerbHandler and calling >> it all over again with parentSessionId would be erroneous when already >> there, so "force" was there for not caring and just taking a snapshot >> again. If we go back to sessionId only, then we do not have this problem I >> guess. >> > >> > Is it important for people to keep the name of a snapshot in both cases >> (global / not global) equal to parentSessionId? Can we just go back? >> > >> > If we want to keep naming snapshots as done now, we might just do >> 20490, I was trying to fix older branches as well but it is very difficult, >> we might just fix the trunk and keep the rest broken, not ideal but ... I >> am just reinventing a wheel in 4.0+, it is just too cumbersome / complex to >> copy trunk's behaviour in 20490 to older stuff. >> > >> > Thanks and regards >> > >> > (1) https://issues.apache.org/jira/browse/CASSANDRA-20490 >> > (2) >> https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/db/repair/CassandraTableRepairManager.java#L84 >> > (3) >> https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/repair/RepairMessageVerbHandler.java#L185-L192 >> > (4) https://issues.apache.org/jira/browse/CASSANDRA-14116 >> > (5) >> https://github.com/apache/cassandra/blob/478c1a9fdf027af0f373f9e26521803ae8775fdb/src/java/org/apache/cassandra/repair/RepairMessageVerbHandler.java#L106-L121 >> >>