Hi!

Before I get to the actual question, a bit of context: for NixOS
we've recently been restructuring the units around PostgreSQL[0]. On
our first iteration we've got

    # postgresql.target
    [Install]
    WantedBy=multi-user.target

    [Unit]
    BindsTo=postgresql.service
    BindsTo=postgresql-setup.service


    # postgresql.service
    [Unit]
    BindsTo=postgresql.target
    After=network.target


    # postgresql-setup.service
    [Unit]
    Requires=postgresql.service
    After=postgresql.service

    [Service]
    Type=oneshot
    RemainAfterExit=yes

The goals of this approach are

* Being able to restart either the target or the server-unit and
  subsequently triggering a restart of both.

* `postgresql.service` being "active" equals the server being in _at
  least_ read-only mode (Type=notify is load-bearing here) and
  `postgresql.target` equals the server being in read-write mode (that's
  one of the things, postgresql-setup is for).

Now, when adding `Restart=always` to `postgresql.service` I noticed
that sometimes both the target and the service get restarted and
sometimes they don't. In a deeper investigation[1] I noticed that
`BindsTo=` behaves differently compared to e.g. PartOf/Requires because
of the `UNIT_ATOM_CANNOT_BE_ACTIVE_WITHOUT` property in the systemd
code which - to my understanding - schedules an immediate stop of the
bound units when the binding unit gets stopped. This races with the
Restart=always when e.g. killing the server. The result is sometimes
that PostgreSQL doesn't get back up after killing it despite the
Restart=always.

Interestingly, I have one machine where I reliably kill unit+target and
another where the auto-restart reliably happens (both are on the exact
same nixpkgs commit, i.e. have the exact same software running). The most
notable difference is that the former is essentially idle, so the
best answer I have is that it somehow depends on how "busy" the
service-manager is.

To my understanding, using a combination of PartOf/Wants (or Requires)
appears to work around this because the Restart gets handled _after_
the stop is propagated[1], so the entire problematic of a potential
race is circumvented.

What I want to get at: to me, this behavior was a little surprising and I
ended up diving into the systemd code to understand the exact differences
and the man-page didn't reflect that aspect that well.

The `BindsTo` section in `systemd.unit(5)` states that it's

> [...], very similar in style to Requires=. However, this dependency type
> is stronger in addition to the effect of Requires= it declares that if the
> unit bound to is stopped, this unit will be stopped too.


To me this reads like the "stronger" aspect of `BindsTo=` is the that
the stop will be propagated, just like it's the case with `PartOf=`. The
property that a unit "strictly has to be in active state" is only
mentioned for the combination of BindsTo & After below.

Now, my plan is to actually contribute a fix for this, but upon starting
I realized, that I need some pointers:

* This seems like a little bit of a special case: you need bidirectional
  dependencies and at least one unit involved needs `BindsTo`. Does it make
  sense to add as another paragraph to `systemd.unit(5)`? Is there a
  better-suited place for this? Or even a reason that speaks for the
  status-quo, i.e. not documenting potential implementation details?

* Am I on the right track with my observations? As mentioned above,
  I noticed today that this is reliably reproducible on an idle VM, but
  not on my workstation.

  Is there some insight I'm missing?
  Assuming, what I wrote above is actually correct, are there some more
  details around this that you'd like to see documented?

Cheers!

Ma27

[0] https://github.com/NixOS/nixpkgs/pull/403645
[1] https://github.com/NixOS/nixpkgs/pull/424625

Reply via email to