On 2025-11-11 00:10, Konstantin Belousov wrote:
On Mon, Nov 10, 2025 at 11:16:01AM -0800, James Gritton wrote:
On 2025-11-10 04:27, Andriy Gapon wrote:
> I played a little bit with OCI containers and podman.
> I had a hiccup with one specific container created for Docker / Linux.
> Its difference from other containers is that it uses multiple daemons
> and a supervisor process to take care of them.  That particular
> supervisor is another variation of "advanced init", it's called s6.
> Apparently, it is relatively popular for container use (not sure about
> host systems).  Probably other alternatives can be / are used for that
> purpose as well.
>
> I think that this is what a supervisor in a container needs:
> 1. its PID is 1;
> 2. orphaned processes get re-parented to it.
>
> I think that (1) is not a hard requirement, but it's an easy way to
> check if the process would be able to work as init.
> Also, some other processes might expect to find init at PID 1, but I am
> not sure about that.
>
> (2) is important for doing the supervising (at least, when
> procctl(PROC_REAP*) is not used) .
>
> I think that on Linux they have separate PID namespace per container, so
> the first process to run naturally gets PID 1.
>
> I think that per-container PID namespace may be an overkill.
> Maybe there is a way to make PID 1 special without going that way.
>
> E.g., a jail could record the first process it runs.
> We can patch up getpid() to return 1 for that process.
> Also, we could patch up the process lookup to return the first process
> in the jail for PID 1.
>
> Re-parenting to the "jail init" sounds harder but should be possible as
> well (e.g., using PROC_REAP).
This is why PROC_REAP was initially implemented: to allow something to
manage zombies of all its descendants, for surrogate init processes.
Later it appeared that at least timeout(1) benefits from it as well.

Good, that would make it that much easier to implement.  It wasn't
there when I did this in the early 2000s (I said a decade ago, but
time passes faster than I give it credit for).

A side note: machinery to reliably signal all specific descendands of
the reaper is way too complicated.

>
> Not sure what to do if the "jail init" dies... should all processes in
> the jail get killed and the jail should die as well (unless persistent)?
>
> This proposal sounds like a kludge but it could be a shortcut to support
> more Linux containers and to allow similar FreeBSD jails / containers
> with alternative init-s / supervisors.

Far from being a kludge, I think it's a feature we need, and one at the top
of my list.  Forcing it to look like PID 1 from jailed perspective is
definitely doable (and something I'd done outside of the project a decade ago). In addition to those two requirements, I would add one that answers
your last question:

3. signals to init and reboot(2) work as they would on the host side.

A jailed reboot would kill all processes and restart rc, and possibly do other kernel-side cleanups yet to be clearly defined. A jailed halt would remove the jail. A jailed single-user mode could exist where instead of init spawning a shell, it just sits around while the system has a chance to
jexec into it.

init handles various signals by rebooting/halting/etc, and it should be able to do that as it does now, by calling reboot(2), directing the kernel to do what it needs to with the jail. If init goes away, it's probably like a
halt and removes the jail.

I completely disagree with this design, I insist that init(8) should
stay as full system init, and reboot(2) should be kept as the machine
reboot.

Why?

With the system calls hooks, init(8) was nearly 100% identical.
There was a place or two that needed to be context-aware, which
is very easy to add.  It seems silly to re-implement init with
just a couple of changes.

reboot(2) wouldn't be the first system call to act differently for
jailed access.  Is doing a useful thing for jails worse than just
doing nothing?  I see in this the beauty of a container that moves
that much closer to feeling like a virtual machine, while retaining
its lightweight nature.  The ideal is that jails "just work," and
working at the syscall level is part of that.

For jail-contained inits, it should be a separate/dedicated implementation of init. It would be aware of its usage model, in particular, it should
proclaim itself the reaper, it should use reaper signalling facilities
for killing processes when shutting the container down (not ever tweaking
the reboot(2)).  It must not have the ugly protection against signals
delivery we have for real init.

I haven't looked into that protection, so I'm neutral on that for
now.  It makes sense to exempt virtual init from virtual killall for
example, but I wouldn't expect to just not deliver certain signals.
I don't recall how I dealth with that specific issue 20 years ago.

- Jamie

Reply via email to