Re: Resource temporarily
W dniu 22.12.2021 o 21:01, Phil Stracchino pisze: > On 12/22/21 12:55, Wietse Venema wrote: >> In this case Postfix is (also) overloading the MySQL server. >> >> - Get a more powerful system (or VM) for the MySQL server. >> >> - Reduce the workload per MySQL server (spread the load across >> multiple servers). > > > > > Perhaps first of all, make sure that mysqld is properly tuned. 90% of > small MySQL/MariaDB deployment performance problems can be resolved > simply by properly tuning it for the available resources. > > But if you're overloading a single MySQL instance, consider using a > Galera cluster (either MySQL or MariaDB) behind ProxySQL or HAproxy. > Read performance on a Galera cluster scales approximately linearly > with the number of nodes, and nodes can be more-or-less transparently > added and dropped on demand. > > (Also, this gives you transparent DB redundancy in the case that a > node crashes or needs to be taken offline for maintenance.) > > I had galera-claster with 3 nodes and haproxy --
Re: Resource temporarily
W dniu 23.12.2021 o 01:53, raf pisze: > On Wed, Dec 22, 2021 at 11:25:10AM +0100, natan wrote: > >> W dniu 21.12.2021 o 18:15, Wietse Venema pisze: >> 10.x.x.10 - is gallera klaster wirth 3 nodes (and max_con set to 1500 >> for any nodes) >> >> when I get this eror I check number of connections >> >> smtpd : 125 >> >> smtp inet n - - - 1 postscreen >> smtpd pass - - - - - smtpd -o >> receive_override_options=no_address_mappings >> >> and total: amavis+lmtp-dovecot+smtpd-o >> receive_override_options=no_address_mappings : 335 >> from: ps -e|grep smtpd |wc -l >> but: for local lmt port:10025 - 5 connection for incomming from amavis port: 10027- 132 connections smtpd - 60 connections ( ps -e|grep smtpd - 196 connections >>> 1) You show two smtpd process counts. What we need are the >>> internet-related smtpd processes counts. >>> >>> 2) Network traffic is not constant. What we need are process counts >>> at the time that postscreen logs the warnings. >>> > 2) Your kernel cannot support the default_process_limit of 1200. > In that case a higher default_process_limit would not help. Instead, > kernel configuration or more memory (or both) would help. 5486 ?Ss 6:05 /usr/lib/postfix/sbin/master cat /proc/5486/limits >>> Those are PER-PROCESS resource limits. I just verified that postscreen >>> does not run into the "Max open files" limit of 4096 as it tries >>> to hand off a connection, because that would result in an EMFILE >>> (Too many open files) kernel error code. >>> >>> Additionally there are SYSTEM-WIDE limits for how much the KERNEL >>> can handle. These are worth looking at when you're trying to handle >>> big traffic on a small (virtual) machine. >>> >>> Wietse >> How I check ? > Googling "linux system wide resource limits" shows a > lot of things including > https://www.tecmint.com/increase-set-open-file-limits-in-linux/ > which mentions sysctl, /etc/sysctl.conf, ulimit, and > /etc/security/limits.conf. > > Then I realised that the problem is with process limits, > not open file limits, but the same methods apply. > > On my VM, the hard and soft process limits are 3681: > > # ulimit -Hu > 3681 > # ulimit -Su > 3681 > > Perhaps yours is less than that. > > To change it permanently, add something like the > following to /etc/security/limits.conf (or to a file in > /etc/security/limits.d/): > > * hard nproc 4096 > * soft nproc 4096 > > Note that this is assuming Linux, and assuming that your > server will be OK with increasing the process limit. That > might not be the case if it's a tiny VM being asked to > do too much. Good luck. > > cheers, > raf > Raf I have: #ulimit -Hu 257577 # ulimit -Su 257577 7343 ? Rs 24:22 /usr/lib/postfix/sbin/master # cat /proc/7343/limits Limit Soft Limit Hard Limit Units Max cpu time unlimited unlimited seconds Max file size unlimited unlimited bytes Max data size unlimited unlimited bytes Max stack size 8388608 unlimited bytes Max core file size 0 unlimited bytes Max resident set unlimited unlimited bytes Max processes 257577 257577 processes Max open files 4096 4096 files Max locked memory 65536 65536 bytes Max address space unlimited unlimited bytes Max file locks unlimited unlimited locks Max pending signals 257577 257577 signals Max msgqueue size 819200 819200 bytes Max nice priority 0 0 Max realtime priority 0 0 Max realtime timeout unlimited unlimited us this is real limits for /usr/lib/postfix/sbin/master --
Re: Resource temporarily
On Thu, Dec 23, 2021 at 09:52:05AM +0100, natan wrote: > W dniu 23.12.2021 o 01:53, raf pisze: > > On Wed, Dec 22, 2021 at 11:25:10AM +0100, natan wrote: > > > >> W dniu 21.12.2021 o 18:15, Wietse Venema pisze: > >> 10.x.x.10 - is gallera klaster wirth 3 nodes (and max_con set to 1500 > >> for any nodes) > >> > >> when I get this eror I check number of connections > >> > >> smtpd : 125 > >> > >> smtp inet n - - - 1 postscreen > >> smtpd pass - - - - - smtpd -o > >> receive_override_options=no_address_mappings > >> > >> and total: amavis+lmtp-dovecot+smtpd-o > >> receive_override_options=no_address_mappings : 335 > >> from: ps -e|grep smtpd |wc -l > >> > but: > for local lmt port:10025 - 5 connection > for incomming from amavis port: 10027- 132 connections > smtpd - 60 connections ( > ps -e|grep smtpd - 196 connections > >>> 1) You show two smtpd process counts. What we need are the > >>> internet-related smtpd processes counts. > >>> > >>> 2) Network traffic is not constant. What we need are process counts > >>> at the time that postscreen logs the warnings. > >>> > > 2) Your kernel cannot support the default_process_limit of 1200. > > In that case a higher default_process_limit would not help. Instead, > > kernel configuration or more memory (or both) would help. > 5486 ?Ss 6:05 /usr/lib/postfix/sbin/master > cat /proc/5486/limits > >>> Those are PER-PROCESS resource limits. I just verified that postscreen > >>> does not run into the "Max open files" limit of 4096 as it tries > >>> to hand off a connection, because that would result in an EMFILE > >>> (Too many open files) kernel error code. > >>> > >>> Additionally there are SYSTEM-WIDE limits for how much the KERNEL > >>> can handle. These are worth looking at when you're trying to handle > >>> big traffic on a small (virtual) machine. > >>> > >>> Wietse > >> How I check ? > > Googling "linux system wide resource limits" shows a > > lot of things including > > https://www.tecmint.com/increase-set-open-file-limits-in-linux/ > > which mentions sysctl, /etc/sysctl.conf, ulimit, and > > /etc/security/limits.conf. > > > > Then I realised that the problem is with process limits, > > not open file limits, but the same methods apply. > > > > On my VM, the hard and soft process limits are 3681: > > > > # ulimit -Hu > > 3681 > > # ulimit -Su > > 3681 > > > > Perhaps yours is less than that. > > > > To change it permanently, add something like the > > following to /etc/security/limits.conf (or to a file in > > /etc/security/limits.d/): > > > > * hard nproc 4096 > > * soft nproc 4096 > > > > Note that this is assuming Linux, and assuming that your > > server will be OK with increasing the process limit. That > > might not be the case if it's a tiny VM being asked to > > do too much. Good luck. > > > > cheers, > > raf > > > Raf I have: > #ulimit -Hu > 257577 > # ulimit -Su > 257577 > > 7343 ? Rs 24:22 /usr/lib/postfix/sbin/master > > # cat /proc/7343/limits > Limit Soft Limit Hard Limit > Units > Max cpu time unlimited unlimited > seconds > Max file size unlimited unlimited > bytes > Max data size unlimited unlimited > bytes > Max stack size 8388608 unlimited > bytes > Max core file size 0 unlimited > bytes > Max resident set unlimited unlimited > bytes > Max processes 257577 257577 > processes > Max open files 4096 4096 > files > Max locked memory 65536 65536 > bytes > Max address space unlimited unlimited > bytes > Max file locks unlimited unlimited > locks > Max pending signals 257577 257577 > signals > Max msgqueue size 819200 819200 > bytes > Max nice priority 0 0 > Max realtime priority 0 0 > Max realtime timeout unlimited unlimited > us > > this is real limits for /usr/lib/postfix/sbin/master > -- That looks like it should be plenty of processes, as long as the server can really support that many. You could test it with something like this: #!/usr/bin/env perl use warnings; use strict; my $max_nprocs = 8000; my $i = 0; while ($i < $max_nprocs) { $i++; my $pid = fork(); die "fork #$i failed: $!\n" unless defined $pid; sleep(
Re: Resource temporarily
W dniu 23.12.2021 o 12:12, raf pisze: > On Thu, Dec 23, 2021 at 09:52:05AM +0100, natan wrote: > >> W dniu 23.12.2021 o 01:53, raf pisze: >>> On Wed, Dec 22, 2021 at 11:25:10AM +0100, natan wrote: >>> W dniu 21.12.2021 o 18:15, Wietse Venema pisze: 10.x.x.10 - is gallera klaster wirth 3 nodes (and max_con set to 1500 for any nodes) when I get this eror I check number of connections smtpd : 125 smtp inet n - - - 1 postscreen smtpd pass - - - - - smtpd -o receive_override_options=no_address_mappings and total: amavis+lmtp-dovecot+smtpd-o receive_override_options=no_address_mappings : 335 from: ps -e|grep smtpd |wc -l >> but: >> for local lmt port:10025 - 5 connection >> for incomming from amavis port: 10027- 132 connections >> smtpd - 60 connections ( >> ps -e|grep smtpd - 196 connections > 1) You show two smtpd process counts. What we need are the > internet-related smtpd processes counts. > > 2) Network traffic is not constant. What we need are process counts > at the time that postscreen logs the warnings. > >>> 2) Your kernel cannot support the default_process_limit of 1200. >>> In that case a higher default_process_limit would not help. Instead, >>> kernel configuration or more memory (or both) would help. >> 5486 ?Ss 6:05 /usr/lib/postfix/sbin/master >> cat /proc/5486/limits > Those are PER-PROCESS resource limits. I just verified that postscreen > does not run into the "Max open files" limit of 4096 as it tries > to hand off a connection, because that would result in an EMFILE > (Too many open files) kernel error code. > > Additionally there are SYSTEM-WIDE limits for how much the KERNEL > can handle. These are worth looking at when you're trying to handle > big traffic on a small (virtual) machine. > > Wietse How I check ? >>> Googling "linux system wide resource limits" shows a >>> lot of things including >>> https://www.tecmint.com/increase-set-open-file-limits-in-linux/ >>> which mentions sysctl, /etc/sysctl.conf, ulimit, and >>> /etc/security/limits.conf. >>> >>> Then I realised that the problem is with process limits, >>> not open file limits, but the same methods apply. >>> >>> On my VM, the hard and soft process limits are 3681: >>> >>> # ulimit -Hu >>> 3681 >>> # ulimit -Su >>> 3681 >>> >>> Perhaps yours is less than that. >>> >>> To change it permanently, add something like the >>> following to /etc/security/limits.conf (or to a file in >>> /etc/security/limits.d/): >>> >>> * hard nproc 4096 >>> * soft nproc 4096 >>> >>> Note that this is assuming Linux, and assuming that your >>> server will be OK with increasing the process limit. That >>> might not be the case if it's a tiny VM being asked to >>> do too much. Good luck. >>> >>> cheers, >>> raf >>> >> Raf I have: >> #ulimit -Hu >> 257577 >> # ulimit -Su >> 257577 >> >> 7343 ? Rs 24:22 /usr/lib/postfix/sbin/master >> >> # cat /proc/7343/limits >> Limit Soft Limit Hard Limit >> Units >> Max cpu time unlimited unlimited >> seconds >> Max file size unlimited unlimited >> bytes >> Max data size unlimited unlimited >> bytes >> Max stack size 8388608 unlimited >> bytes >> Max core file size 0 unlimited >> bytes >> Max resident set unlimited unlimited >> bytes >> Max processes 257577 257577 >> processes >> Max open files 4096 4096 >> files >> Max locked memory 65536 65536 >> bytes >> Max address space unlimited unlimited >> bytes >> Max file locks unlimited unlimited >> locks >> Max pending signals 257577 257577 >> signals >> Max msgqueue size 819200 819200 >> bytes >> Max nice priority 0 0 >> Max realtime priority 0 0 >> Max realtime timeout unlimited unlimited >> us >> >> this is real limits for /usr/lib/postfix/sbin/master >> -- > That looks like it should be plenty of processes, > as long as the server can really support that many. > > You could test it with something like this: > > #!/usr/bin/env perl > use warnings; > use strict; > my $max_nprocs = 8000; > my $i = 0; > while ($i < $max_nprocs) > { > $i++; > my $pid = fork(
Re: After network outage postfix found not running
Bob Proulx: > Wietse Venema wrote: > > Bob Proulx: > > > Any ideas on why postfix would not be running after such an event on > > > two of the systems but okay on the others? > > > > LOGS. Postfix logs a sh*load, including processes that fail to > > start. If the systems were unable to record this in LOGS, then you > > will never know. > > I guess we will never know then. Because I showed the relevant logs. > I would have showed more but the large message was rejected due to > size. But there wasn't anything more clueful than the logs I showed. Postfix was only the messenger of bad news. It does not spontaneously self-destruct. Wietse
Re: [PATCH 2/3] Fix parallel build dependencies
On Wed, 22 Dec 2021 at 22:21, Wietse Venema wrote: > > Christian G?ttsche: > > Plugin shared util objects require the global util object to be build. > > > What was the make command? /usr/bin/make -j2 LD_LIBRARY_PATH=$(pwd)/lib:${LD_LIBRARY_PATH} see https://salsa.debian.org/cgzones/postfix-dev/-/jobs/2304623/raw for a failed build log
Re: After network outage postfix found not running
Bob Proulx: > Any ideas on why postfix would not be running after such an event on > two of the systems but okay on the others? Wietse Venema wrote: LOGS. Postfix logs a sh*load, including processes that fail to start. If the systems were unable to record this in LOGS, then you will never know. On 22.12.21 21:41, Bob Proulx wrote: I guess we will never know then. Because I showed the relevant logs. I would have showed more but the large message was rejected due to size. But there wasn't anything more clueful than the logs I showed. It's not terribly important. It was just an oddity. Because Postfix is so very reliable that it was unusual to see on two systems it had stopped. But again it is very unusual to have the root file system blocking for so long. it's still possible that: - postfix was killed by e.g. OOM killer, in which case it could not log that. - the logs were lost because of systemd's log limits there are multiple lined of postfix/master. it also could be systemd restarting postfix and giving up after some time -- Matus UHLAR - fantomas, uh...@fantomas.sk ; http://www.fantomas.sk/ Warning: I wish NOT to receive e-mail advertising to this address. Varovanie: na tuto adresu chcem NEDOSTAVAT akukolvek reklamnu postu. BSE = Mad Cow Desease ... BSA = Mad Software Producents Desease
Re: After network outage postfix found not running
Demi Marie Obenour: > My intuition is that either some timeout somewhere got hit, or that > some I/O failed (rather than being queued forever) and caused an error > paging in some code. That would cause Postfix to die with SIGBUS. If the file system was unavailable, then yes, failure to page in some code would be fatal. > Do you have Postfix set to automatically be restarted if it crashes? I expect that the restart would fail for the same reason as you describe above. Wietse
Re: [PATCH 2/3] Fix parallel build dependencies
Christian G?ttsche: > On Wed, 22 Dec 2021 at 22:21, Wietse Venema wrote: > > > > Christian G?ttsche: > > > Plugin shared util objects require the global util object to be build. > > > > > What was the make command? > > /usr/bin/make -j2 LD_LIBRARY_PATH=$(pwd)/lib:${LD_LIBRARY_PATH} > > see https://salsa.debian.org/cgzones/postfix-dev/-/jobs/2304623/raw > for a failed build log The bug is that you're linking Postfix database plugins with libpostfix-util or libpostfix-global. That is not supported. You have: AUXLIBS_CDB="-lcdb -L../../lib -L. -lpostfix-util" \ AUXLIBS_LDAP="-lldap -llber -L../../lib -L. -lpostfix-util -lpostfix-global" \ AUXLIBS_LMDB="-llmdb -L../../lib -L. -lpostfix-util" \ AUXLIBS_MYSQL="-lmysqlclient -L../../lib -L. -lpostfix-util -lpostfix-global" \ AUXLIBS_PCRE="-lpcre -L../../lib -L. -lpostfix-util" \ AUXLIBS_PGSQL="-lpq -L../../lib -L. -lpostfix-util -lpostfix-global" \ AUXLIBS_SQLITE="-lsqlite3 -L../../lib -L. -lpostfix-util -lpostfix-global -lpthread" \ You should have: AUXLIBS_CDB="-lcdb" AUXLIBS_LDAP="-lldap -llber" AUXLIBS_LMDB="-llmdb" AUXLIBS_MYSQL="-lmysqlclient" AUXLIBS_PCRE="-lpcre" AUXLIBS_PGSQL="-lpq" AUXLIBS_SQLITE="-lsqlite3" Also the following is unnecessary: make -j2 LD_LIBRARY_PATH=$(pwd)/lib:${LD_LIBRARY_PATH} Instead, remove the LD_LIBRARY_PATH stuff do this: make -j2 I'll add a check to makedefs to fail the build with an UNSUPPORTED error if it sees that database plugins are linked with libpostfix-*. I'll also fix the makedefs check to reject LD_LIBRARY_PATH settings. Wietse
Re: [PATCH 2/3] Fix parallel build dependencies
On Thu, 23 Dec 2021 at 20:49, Wietse Venema wrote: > > Christian G?ttsche: > > On Wed, 22 Dec 2021 at 22:21, Wietse Venema wrote: > > > > > > Christian G?ttsche: > > > > Plugin shared util objects require the global util object to be build. > > > > > > > What was the make command? > > > > /usr/bin/make -j2 LD_LIBRARY_PATH=$(pwd)/lib:${LD_LIBRARY_PATH} > > > > see https://salsa.debian.org/cgzones/postfix-dev/-/jobs/2304623/raw > > for a failed build log > > The bug is that you're linking Postfix database plugins with > libpostfix-util or libpostfix-global. That is not supported. > > You have: > > AUXLIBS_CDB="-lcdb -L../../lib -L. -lpostfix-util" \ > AUXLIBS_LDAP="-lldap -llber -L../../lib -L. -lpostfix-util > -lpostfix-global" \ > AUXLIBS_LMDB="-llmdb -L../../lib -L. -lpostfix-util" \ > AUXLIBS_MYSQL="-lmysqlclient -L../../lib -L. -lpostfix-util > -lpostfix-global" \ > AUXLIBS_PCRE="-lpcre -L../../lib -L. -lpostfix-util" \ > AUXLIBS_PGSQL="-lpq -L../../lib -L. -lpostfix-util -lpostfix-global" \ > AUXLIBS_SQLITE="-lsqlite3 -L../../lib -L. -lpostfix-util -lpostfix-global > -lpthread" \ > > You should have: > > AUXLIBS_CDB="-lcdb" > AUXLIBS_LDAP="-lldap -llber" > AUXLIBS_LMDB="-llmdb" > AUXLIBS_MYSQL="-lmysqlclient" > AUXLIBS_PCRE="-lpcre" > AUXLIBS_PGSQL="-lpq" > AUXLIBS_SQLITE="-lsqlite3" > Thanks, this works. > Also the following is unnecessary: > > make -j2 LD_LIBRARY_PATH=$(pwd)/lib:${LD_LIBRARY_PATH} > > Instead, remove the LD_LIBRARY_PATH stuff do this: > > make -j2 > True, seems to be not necessary. > I'll add a check to makedefs to fail the build with an UNSUPPORTED > error if it sees that database plugins are linked with libpostfix-*. > > I'll also fix the makedefs check to reject LD_LIBRARY_PATH settings. > > Wietse Thanks, please disregard those two sent patches.
Re: message_size_limit documentation
Scott Kitterman: > Currently, postconf.5 has this to say about message_size_limit: > > message_size_limit (default: 1024) > > The maximal size in bytes of a message, including envelope information. > > Note: be careful when making changes. Excessively small values will result > in the loss of non-delivery notifications, when a bounce message size exceeds > the local or remote MTA's message size limit. > > > It documents the default, but not the maximum. The maximum is determined by (kernel) resource limits, file system sizes, and... > Apparently there is one (and > who would care, one of Debian's users, apparently [1]). I'm not particularly > confused about why there would be a maximum, but it might be reasonable to > document what it is. Perhaps add something like "Maximum value is > 2147483647." at the end of the note so that users don't have to find out the > hard way: > > fatal: bad numerical configuration: message_size_limit = 2147483648 That is the LONG_MAX value for 32-bit machines. It's much bifgger for 64-bit systems. I guess we could put that in the manpage. I have ab old wishlist item to migrate file sizes from to off_t (which is 64 bits on most systems). But that is a lot of effort, and I was kind-of hoping that 32-bit systems will go away. Wietse > Scott K > > [1] https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=960272 > > >
Re: [PATCH 2/3] Fix parallel build dependencies
Christian G?ttsche: > > I'll add a check to makedefs to fail the build with an UNSUPPORTED > > error if it sees that database plugins are linked with libpostfix-*. > > > > I'll also fix the makedefs check to reject LD_LIBRARY_PATH settings. > > Thanks, please disregard those two sent patches. No problem. I did take your \' fixes. This text was written when my primary platform was nroff based. I have verified that the backslash wasn't needed there, too. So we have taught each other that less can be better. Wietse
Re: message_size_limit documentation
On Thursday, December 23, 2021 3:51:57 PM EST Wietse Venema wrote: > Scott Kitterman: > > Currently, postconf.5 has this to say about message_size_limit: > > > > message_size_limit (default: 1024) > > > > The maximal size in bytes of a message, including envelope > > information. > > > > Note: be careful when making changes. Excessively small values will > > result > > > > in the loss of non-delivery notifications, when a bounce message size > > exceeds the local or remote MTA's message size limit. > > > > > > It documents the default, but not the maximum. > > The maximum is determined by (kernel) resource limits, file system sizes, > and... > > > Apparently there is one (and > > > > who would care, one of Debian's users, apparently [1]). I'm not > > particularly confused about why there would be a maximum, but it might be > > reasonable to document what it is. Perhaps add something like "Maximum > > value is 2147483647." at the end of the note so that users don't have to > > find out the hard way: > > > > fatal: bad numerical configuration: message_size_limit = 2147483648 > > That is the LONG_MAX value for 32-bit machines. It's much bifgger > for 64-bit systems. > > I guess we could put that in the manpage. I have ab old wishlist > item to migrate file sizes from to off_t (which is 64 bits on > most systems). > > But that is a lot of effort, and I was kind-of hoping that 32-bit > systems will go away. Thanks. I don't think it's worth a lot of effort. I'd imagine it's a pretty niche use case to send multi-gigabyte files via SMTP. People do do it though (clearly or there wouldn't be a bug). I wrestled with a few options for a simple explanation, but didn't come up with anything I particularly liked. I think it's correct that there's a hole in the documentation, but I don't have a good recommendation on how to fill it. Scott K
Re: After network outage postfix found not running
Could a watchdog timer have killed master(8) if it were suspended long enough? > On 23 Dec 2021, at 1:57 pm, Wietse Venema wrote: > >> My intuition is that either some timeout somewhere got hit, or that >> some I/O failed (rather than being queued forever) and caused an error >> paging in some code. That would cause Postfix to die with SIGBUS. > > If the file system was unavailable, then yes, failure to page in > some code would be fatal. -- Viktor.
Re: message_size_limit documentation
Scott Kitterman: > Thanks. I don't think it's worth a lot of effort. I'd imagine it's a pretty > niche use case to send multi-gigabyte files via SMTP. People do do it though > (clearly or there wouldn't be a bug). > > I wrestled with a few options for a simple explanation, but didn't come up > with anything I particularly liked. I think it's correct that there's a hole > in the documentation, but I don't have a good recommendation on how to fill > it. In Postfix 3.7 I have updated the text for message_size_limit. message_size_limit (default: 1024) The maximal size in bytes of a message, including envelope information. The value cannot exceed LONG_MAX (typically, a 32-bit or 64-bit signed integer). Ditto for mailbox_size_limit. Wietse
Re: After network outage postfix found not running
Matus UHLAR - fantomas wrote: > it's still possible that: > - postfix was killed by e.g. OOM killer, in which case it could not log that. I disable the OOM with vm.overcommit_memory = 2 so that particular thing won't be it. > - the logs were lost because of systemd's log limits That is possible. The two failing systems were ones running systemd. I am not a fan. I am looking at rsyslog logging. > there are multiple lined of postfix/master. > > it also could be systemd restarting postfix and giving up after some time I don't believe systemd will try to restart postfix. Good ideas though. Thank you for brainstorming along with me. Bob
Re: After network outage postfix found not running
Wietse Venema wrote: > Postfix was only the messenger of bad news. It does not > spontaneously self-destruct. I have always found Postfix to be extremely reliable and robust. Which was why this happening on two different systems was such an oddity. Bob
Re: After network outage postfix found not running
Viktor Dukhovni wrote: > Could a watchdog timer have killed master(8) if it were suspended > long enough? Seems plausible. I could see something in the code timing out since things would be blocked waiting for I/O for so long.a > Demi Marie Obenour: > > My intuition is that either some timeout somewhere got hit, or that > > some I/O failed (rather than being queued forever) and caused an error > > paging in some code. That would cause Postfix to die with SIGBUS. > > If the file system was unavailable, then yes, failure to page in > some code would be fatal. This is a good brainstorm. I wasn't thinking about the swap side of memory. It seems very plausible to me that a paged out block might have been needed. And that might have timed out and been reported as a an I/O failure. Which would have killed the process. Or possibly the reverse. The system may have tried to page out a block and the writing of that block may have timed out as well. > > Do you have Postfix set to automatically be restarted if it crashes? No. Postfix is very reliable and robust. It has never been needed. And I think I will resist the urge to add automated restarting of postfix now too. Because this was a very unusual situation. I know we always fight the last war. I doubt this will be a repeating problem. But it would add a layer of snag that another admin might not be expecting. Plus I have now learned that if the network is offline for any significant time then all affected systems should be rebooted as a precautionary. And a reboot is always okay. Systems reboot just fine. Instead I think I will add a watchdog of some sort that would automatically detect this type of network attached storage outage and then automatically reboot the system if it detects that it is recovering from such a state. That's harder to do. But it solves the problem for the entire system globally. > I expect that the restart would fail for the same reason as you > describe above. I would expect that it would block waiting for I/O and simply wait to start. It would stack up as another process that increases the load average. And then eventually when the disk request was serviced then it would continue and start then. Thank you everyone for brainstorming along with me. It's a good learning experience. And I think I know I need a way to detect that the network attached block storage has been offline too long and that the system when recovered from that needs to be rebooted. Bob
Re: After network outage postfix found not running
On Thu, 23 Dec 2021 17:16:10 -0700 Bob Proulx wrote: > Wietse Venema wrote: > > Postfix was only the messenger of bad news. It does not > > spontaneously self-destruct. > > I have always found Postfix to be extremely reliable and robust. > Which was why this happening on two different systems was such an > oddity. > > Bob From my own observations on debian: systemd's default config does not wait for the network before starting postfix and will not retry. If it is actually set up to wait, then systemd is ignoring that bit. --
Re: Resource temporarily
On Thu, Dec 23, 2021 at 12:34:20PM +0100, natan wrote: > W dniu 23.12.2021 o 12:12, raf pisze: > > That looks like it should be plenty of processes, > > as long as the server can really support that many. > > > > You could test it with something like this: > > > > #!/usr/bin/env perl > > use warnings; > > use strict; > > my $max_nprocs = 8000; > > my $i = 0; > > while ($i < $max_nprocs) > > { > > $i++; > > my $pid = fork(); > > die "fork #$i failed: $!\n" unless defined $pid; > > sleep(10), exit(0) if $pid == 0; > > } > > print "$i forks succeeded\n"; > > > > For example, a VM here reports 7752 for ulimit -Su, > > but the above script failed on the 3470th fork. > > > > cheers, > > raf > > > in machine with postfix > > time ./1.py > 12000 forks succeeded > > real 0m1,365s > user 0m0,088s > sys 0m1,276s That looks like it should be enough. Sorry, I'm out of ideas. cheers, raf