That true / false for "force" argument seems to be purely derived from
whether it is global repair (force set to false) or not (set to true).

https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/repair/RepairMessageVerbHandler.java#L185-L192

On Tue, May 13, 2025 at 5:25 PM David Capwell <dcapw...@apple.com> wrote:

> That method can be backported even if it’s just for this single case. For
> the vtable I needed something more friendly so made this for that. It’s
> also used for dedup logic in repair retry works.
>
> I didn’t look closely, what sets force to be true or false? I assumed this
> was the force flag in repair (as these are repair snapshots) but sounds
> like that isn’t the case?
>
> Sent from my iPhone
>
> On May 12, 2025, at 11:58 PM, Štefan Miklošovič <smikloso...@apache.org>
> wrote:
>
> 
> David,
>
> that "force" you copied from nodetool, that has nothing to do with "force"
> in these snapshots but I think you already know that.
>
> org.apache.cassandra.repair.RepairJobDesc#determanisticId is only in 4.1+
> included, not in 4.0. That's quite a bummer but we could limit the patch
> for 4.1+ only and leave 4.0 be.
>
> When we do that, then "force" will become redundant and we can perhaps
> remove it?
>
> How I look at it is that the introduction of "force" flag in
> TableRepairManager is the result of a developer trying to consolidate the
> name of snapshots from parent id / session id to both parent id / parent id
> but there was the realisation of that being problematic because else branch
> might be invoked multiple times so it would start to clash. So they said
> "ok and here we just override and take no matter what" but the
> implementation of that was not finished as it clearly fails.
>
> So if we moved to determanisticId for both branches of if / else and
> remove force, we should be good? RepairJobDesc#deterministicId is basically
> unique? I see it is created from parent id, session id, keyspace, table and
> ranges. I do not think there would ever be two cases of snapshots like that
> with exact same values.
>
> WDYT?
>
> Regards
>
> On Mon, May 12, 2025 at 6:55 PM David Capwell <dcapw...@apple.com> wrote:
>
>> > "force" can be true / false. When true, then it will not check if a
>> snapshot exists
>>
>>
>> My mental node for “force” was only to deal with down nodes, and nothing
>> to do with snapshots… so this feels like a weird behavior
>>
>> @Option(title = "force", name = {"-force", "--force"}, description = "Use
>> -force to filter out down endpoints”)
>>
>> Our documentation even only calls out dealing with downed enodes…
>>
>>
>> >   a) we will remove existing snapshot and take a new one
>>
>> So this is racy and can make repairs brittle.  The snapshot is at the
>> RepairSession (global) or RepairJob level (a subset of ranges for a single
>> table, different RepairJobs can work on the same table, just different
>> ranges). With this code path you also have 1 more variable at play that
>> makes things complex: isGlobal
>>
>> public boolean isGlobal() {return dataCenters.isEmpty() &&
>> hosts.isEmpty();}
>>
>> Global does the full range, where as non-global covers the partial range
>>
>> So when you do host or dc this snapshot is isolated to the RepairJob and
>> not the session logically, but physically its isolated to the session;
>> which doesn’t make any sense.
>>
>> > 2) if we do not want to fix "force", then it is most probably just
>> redundant and we would remove it but it order to not fail, we would need to
>> go back to naming snapshots not parent id / parent id but parent id /
>> session id for global and non-global repair respectively (basically we
>> would return to pre-14116 behavior).
>>
>> org.apache.cassandra.repair.RepairJobDesc#determanisticId
>>
>> This is scoped to a single RepairJob, so wouldn’t have the issues with
>> concurrency.  So if we were going to alter what the snapshot name is, I
>> would strongly prefer this (its also present in the repair vtable, so not
>> some hidden UUID)
>>
>> I personally feel the lowest risk is to switch to the job name and away
>> from the session name we had… I do wonder about removing this “force”
>> semantic as its not documented to users as far as I can tell, so is there
>> any place that defines this behavior?
>>
>> > we might just fix the trunk and keep the rest broken, not ideal but
>>
>> With some patches its hard to tell if something is a bug or a feature… so
>> altering the semantic is that making an improvement to allow better
>> combinations of repair options, or is it fixing a bug (aka worthy of a back
>> port)…. I know most users do parallel repair (default) so this snapshot
>> logic doesn’t impact most users, but its also why its been so long before
>> this issue got noticed… so coverage around snapshots currently isn’t the
>> best.
>>
>> Sorry for this reply, its less of an answer and more a dump while I think
>> through the problem you stated….
>>
>> > On May 12, 2025, at 9:18 AM, Štefan Miklošovič <smikloso...@apache.org>
>> wrote:
>> >
>> > Hello,
>> >
>> > I am looking for advice for (1). There is quite a quality archaeology
>> to go through. I have asked around already (thanks for that!) but nobody
>> seems to ultimately remember / tell what it is for.
>> >
>> > There is (2). "force" can be true / false. When true, then it will not
>> check if a snapshot exists. If it exists and "force" is true, this has
>> never worked - that is what CASSANDRA-20490 is for. It would try to hard
>> link twice and it would error out.
>> >
>> > This "force" logic is called here (3). If ParentRepairSession (prs)  is
>> global, "force" is false, but if it is not (not global means that some dc
>> was specified), then it is true.
>> >
>> > Notice that both branches of that if/else use
>> desc.parentSessionId.toString() as snapshots' name.
>> >
>> > This "force" logic was introduced in (4) but it seems to me that it
>> never worked. AFAICT this was just broken on delivery.
>> >
>> > One commit before (4), it looked like (5). The first branch of "if"
>> used parentSessionId as a snapshot name and it checked if it exists or not
>> before taking it, the else branch took sessionId (not parent session id)
>> and took a snapshot of that name without any check if it exists or not.
>> >
>> > From javadoc on TableRepairManager, "force" means: (introduced in (4)
>> CASSANDRA-14116)
>> >
>> > "If force is true, a snapshot should be taken even if one already
>> exists with that name.".
>> >
>> > Questions:
>> >
>> > 1) do we want to fix the behaviour of "force"? (patch 20490)
>> >   if the answer is yes, what is the "semantics" about this. What does
>> "force" mean exactly? It can mean
>> >   a) we will remove existing snapshot and take a new one
>> >   b) we will add more SSTables onto existing snapshot, we just hardlink
>> more SSTables into same snapshot (that is what 20490 is doing)
>> >
>> > 2) if we do not want to fix "force", then it is most probably just
>> redundant and we would remove it but it order to not fail, we would need to
>> go back to naming snapshots not parent id / parent id but parent id /
>> session id for global and non-global repair respectively (basically we
>> would return to pre-14116 behavior).
>> >
>> > I think the introduction of "force" flag was there because there might
>> be multiple calls to "else" branch in RepairMessageVerbHandler and calling
>> it all over again with parentSessionId would be erroneous when already
>> there, so "force" was there for not caring and just taking a snapshot
>> again. If we go back to sessionId only, then we do not have this problem I
>> guess.
>> >
>> > Is it important for people to keep the name of a snapshot in both cases
>> (global / not global) equal to parentSessionId? Can we just go back?
>> >
>> > If we want to keep naming snapshots as done now, we might just do
>> 20490, I was trying to fix older branches as well but it is very difficult,
>> we might just fix the trunk and keep the rest broken, not ideal but ... I
>> am just reinventing a wheel in 4.0+, it is just too cumbersome / complex to
>> copy trunk's behaviour in 20490 to older stuff.
>> >
>> > Thanks and regards
>> >
>> > (1) https://issues.apache.org/jira/browse/CASSANDRA-20490
>> > (2)
>> https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/db/repair/CassandraTableRepairManager.java#L84
>> > (3)
>> https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/repair/RepairMessageVerbHandler.java#L185-L192
>> > (4) https://issues.apache.org/jira/browse/CASSANDRA-14116
>> > (5)
>> https://github.com/apache/cassandra/blob/478c1a9fdf027af0f373f9e26521803ae8775fdb/src/java/org/apache/cassandra/repair/RepairMessageVerbHandler.java#L106-L121
>>
>>

Reply via email to