Hi all, As you may know, the lists were down.
The issue was trivial, but I didn’t notice, so I didn’t get them back up for a full week. Timeline of events: - 2023-10-22: Server reboots uncleanly (why?); after rebooting, mailman doesn’t come back up due to a stale lockfile - ??: Janet Cobb notifies me on Discord (I don’t see this message; perhaps it was sent to the wrong user?) - 2023-10-26: Janet Cobb notifies me by email (but I didn’t notice it; I see it now though) - 2023-10-27: Late at night I notice due to ALT messages ending up in my inbox - 2023-10-28: I procrastinate on dealing with the issue - 2023-10-29: Janet Cobb notifies me on Mastodon, and I fix the issue I did take some actions to prevent this from happening again: - Changed systemd configuration to ask mailmanctl to automatically clean up stale locks. - Added a CloudWatch alarm that specifically checks whether mailman qrunner processes are running. The issue actually triggered my existing alarm for any errors being logged in the Mailman log, but there are spurious error logs often enough that I’ve been too lazy to check up on it. The new alarm is less broad but also less prone to false positives. However… You all might want to consider the possibility of moving to groups.io. Don’t get me wrong, I’m happy to continue running the lists for another 10 years and beyond. But I have definitely been neglecting proper maintenance and monitoring, and that neglect will probably continue, leading to the possibility of more outages like this. Up to you! - omd