16] HA colocation rules

Daniel Kral Fri, 25 Apr 2025 06:26:44 -0700

On 4/25/25 14:25, Fiona Ebner wrote:

Am 25.04.25 um 10:36 schrieb Daniel Kral:

On 4/24/25 12:12, Fiona Ebner wrote:
As suggested by @Lukas off-list, I'll also try to make the check
selective, e.g. the user has made an infeasible change to the config
manually by writing to the file and then wants to create another rule.
Here it should ignore the infeasible rules (as they'll be dropped
anyway) and only check if the added rule / changed rule is infeasible.


How will you select the rule to drop? Applying the rules one-by-one to
find a first violation?

AFAICS we could use the same helpers to check whether the rules arefeasible, and only check whether the added / updated ruleid is one thatis causing these troubles. I guess this would be a reasonable optionwithout duplicating code, but still check against the whole config.There's surely some optimization potential here, but then we would havea larger problem at reloading the rule configuration for the manageranyway. For the latter I could check for what size of a largerconfiguration this could become an actual bottleneck.

For either adding a rule or updating a rule, we would just make thechange to the configuration in-memory and run the helper. Depending onthe result, we'd store the config or error out to the API user.

But as you said, it must not change the user's configuration in the end
as that would be very confusing to the user.


Okay, so dropping dynamically. I guess we could also disable such rules
explicitly/mark them as being in violation with other rules somehow:
Tri-state enabled/disabled/conflict status? Explicit field?

Something like that would make such rules easily visible and have the
configuration better reflect the actual status.

As discussed off-list now: we can try to re-enable conflicting rules
next time the rules are loaded.


Hm, there's three options now:

- Allowing conflicts over the create / update API and auto-resolving theconflicts as soon as we're able to (e.g. on the load / save where therule becomes feasible again).

- Not allowing conflicts over the create / update API, but set the stateto 'conflict' if manual changes (or other circumstances) made the rulesbe in conflict with one another.

- Having something like the SDN config, where there's a workingconfiguration and a "draft" configuration that needs to be applied. Soconflicts are allowed in drafts, but not in working configurations.

The SDN option seems too much for me here, but I just noticed somesimilarity.

I guess one of the first two makes more sense. If there's no argumentsagainst this, I'd choose the second option as we can always allowintentional conflicts later if there's user demand or we see otherreasons in that.

The only thing that I'm unsure about this, is how we would migrate the
`nofailback` option, since this operates on the group-level. If we keep
the `<node>(:<priority>)` syntax and restrict that each service can only
be part of one location rule, it'd be easy to have the same flag. If we
go with multiple location rules per service and each having a score or
weight (for the priority), then we wouldn't be able to have this flag
anymore. I think we could keep the semantic if we move this flag to the
service config, but I'm thankful for any comments on this.

My gut feeling is that going for a more direct mapping, i.e. each
location rule represents one HA group, is better. The nofailback flag
can still apply to a given location rule I think? For a given service,
if a higher-priority node is online for any location rule the service is
part of, with nofailback=0, it will get migrated to that higher-priority
node. It does make sense to have a given service be part of only one
location rule then though, since node priorities can conflict between
rules.


Yeah, I think this is the reasonable option too.

I briefly discussed this with @Fabian off-list and we also agreed that
it would be good to make location rules as 1:1 to location rules as
possible and keep the nofailback per location rule, as the behavior of
the HA group's nofailback could still be preserved - at least if there's
only a single location rule per service at least.

---

On the other hand, I'll have to take a closer look if we can do
something about the blockers when creating multiple location rules where
e.g. one has nofailback enabled and the other has not. As you already
said, they could easily conflict between rules...

My previous idea was to make location rules as flexible as possible, so
that it would theoretically not matter if one writes:

location: rule1
     services: vm:101
     nodes: node1:2,node2:1
     strict: 1
or:

location: rule1
     services: vm:101
     nodes: node1
     strict: 1

location: rule2
     services: vm:101
     nodes: node2
     strict: 1

The order which one's more important could be encoded in the order which
it is defined (if one configures this in the config it's easy, and I'd
add an API endpoint to realize this over the API/WebGUI too), or maybe
even simpler to maintain: just another property.


We cannot use just the order, because a user might want to give two
nodes the same priority. I'd also like to avoid an implicit
order-priority mapping.


Right, good point!

But then, the
nofailback would have to be either moved to some other place...

Or it is still allowed in location rules, but either the more detailed
rule wins (e.g. one rule has node1 without a priority and the other does
have node1 with a priority)


Maybe we should prohibit multiple rules with the same service-node pair?
Otherwise, my intuition says that all rules should be considered and the
rule with the highest node priority should win.

Yes, I think that would make the most sense as disallowing users to putthe same two or more services in multiple negative colocation rules.

or the first location rule with a specific
node wins and the other is ignored. But this is already confusing when
writing it out here...

I'd prefer users to write the former (and make this the dynamic
'canonical' form when selecting nodes), but as with colocation rules it
could make sense to separate them for specific reasons / use cases.


Fair point.

And another reason why it could still make sense to go that way is to
allow "negative" location rules at a later point, which makes sense in
larger environments, where it's easier to write opt-out rules than opt-
in rules, so I'd like to keep that path open for the future.


We also discussed this off list: Daniel convinced me that it would be
cleaner if the nofailback property would be associated to a given
service rather than a given location rule. And if we later support pools
as resources, the property should be associated to (certain or all)
services in that pool and defined in the resource config for the pool.

To avoid the double-negation with nofailback=0, it could also be renamed
to a positive property, below called "auto-elevate", just a working name.

A small concern of mine was that this makes it impossible to have a
service that only "auto-elevates" to a specific node with a priority,
but not others. This is already not possible right now, and honestly,
that would be quite strange behavior and not supporting that is unlikely
to hurt real use cases.



_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel

Re: [pve-devel] [RFC cluster/ha-manager 00/16] HA colocation rules

Reply via email to