Hi,
since a while I have some strange network issues in some parts of a
particular system.
A build with src from 2023-07-26 was still working ok. An update to
2023-08-07 broke some parts in a strange way. I tried again with src
from 2023-08-11 didn't fix things.
What I see is... strange and complex.
I have a jail host with about 23 jails. All the jails are sitting on a
bridge, and have IPv6 and IPV4 addresses. One jail is a DNS server for a
domain which contains all the DNS entries for all the jails on the
system (and more). Other jails have mysql (FS socket for mysql
nullfs-mounted into other jails for connecting to mysql via the FS
socket instead of the network), dovecot IMAP server, postfix SMTP
server, a nginx based reverse proxy and 2 different kinds of webmail
solutions (old php74 based on the way out on favour for a php81 based
one), a wiki and other things.
With the old working basesystem I can login into the old webmail system
and read mails. With the newer non-working basesystem I still can login,
but the auth-credentials are not stored in the backend-session and as
such no mail is listed at all, as this requires subsequent connections
from php to dovecot. This webmail system is going via the reverse proxy
to the webmail-jail which has another nginx configured to connect to the
php-fpm backend.
With the new webmail system I can login, read mails, and even are
writing this email from. The first login to it fails. The second
succeeds. It is not behind the reverse proxy (as it is not fully ready
yet for access from the outside (DSL with NAT on the DSL-box to the
reverse proxy)), but a single nginx with php-fpm backend (instead of 2
nginx + php-fpm as in the old webmail).
The wiki behind the reverse proxy is sometimes working, and sometimes
not. Sometimes it is providing everything, sometimes parts of the site
is missing (e.g. pictures / icons). Sometimes there is simply a blank
page, sometimes it gives an error message from the wiki about an
unforseen bug...
The error messages in the nginx reverse proxy log for all the strange
failure cases is "accept4() failed (53: Software caused connection
abort)". Sometimes I get "upstream timed out". When it times out in the
reverse proxy instead of getting the accept4-errors, I see the same
accept4-error message in the nginx inside the wiki or webmail jail
instead.
I tried to recompile all the components of the wiki and reverse proxy
and php81 based webmail, to no avail. The issue persists.
Does this ring a bell to someone? Maybe some network or socket or VM
based changes in this timeframe which smell like they could be related
and maybe good candidates for a backup-test? Any ideas how to drill down
with debugging to have a more simple test-case than the complex setup of
if_bridge, epair, jails, wiki, php, nginx, ...?
Bye,
Alexander.
--
http://www.Leidinger.net alexan...@leidinger.net: PGP 0x8F31830F9F2772BF
http://www.FreeBSD.org netch...@freebsd.org : PGP 0x8F31830F9F2772BF