DIS: Another list outage postmortem

omd via agora-discussion Fri, 26 Jul 2024 09:43:37 -0700

This one was pretty silly.

On 2024-7-1, I performed some upgrades on the machine hosting the lists.  I 
updated Debian as well as my installation of the SMTP server Haraka.  After 
doing so, I went to some effort to manually verify that the lists were still 
working, and even fixed a preexisting minor issue that only affected me.  
However, this upgrade broke the aws CLI that I was using for CloudWatch alerts. 
 I was in a hurry so I left it that way for the time being.


On 2024-7-16, for the first time since the upgrade, Haraka tried to auto-update 
the Public Suffix List in its own source code directory, which failed because I 
intentionally restricted the user running Haraka to read-only access to its 
source code.  I’m not sure whether this auto-update behavior was somehow not 
happening before (it seems to have been added five years ago [1], and the 
previously-running version isn’t *that* old), or whether I dealt with this for 
the previous version and then forgot having done so.

In any case, the auto-update failure promptly terminated the whole Haraka 
process.  systemd did not restart it automatically (turns out that unlike 
launchd, systemd requires explicitly requesting auto-restart), so it stayed 
down.  I didn’t have a CloudWatch alert for Haraka being down, and even if I 
had, I wouldn’t have gotten it because my CloudWatch alerts were still down 
(I’d gotten lazy about fixing them)).  And as far as I can tell, nobody 
notified me manually until Random Internet Cat sent me a private Mastodon 
message on 7-22, and I didn’t even see that message until later because I 
wasn’t checking Mastodon.  There were some ALT messages sent as a result of the 
downtime that reached my email inbox, but to make things worse, I wasn’t 
checking my inbox.  (I recommend messaging me on Discord if you want to get my 
attention.)

So I didn’t know the lists were down until I happened to check my email and 
notice the chatter, I think on 7-21.  On 7-22 I fixed the issue (well, more 
like hacked around it) by giving the user write access to that directory, and 
started Haraka back up.  Then I didn’t get around to writing up what happened 
until today.

Also today, I fixed the CloudWatch alerts, added an alert for Haraka or other 
processes being down, set Haraka to auto-restart, performed additional 
upgrades, and enabled TLS support for both incoming and outgoing mail.  
Hopefully none of that breaks anything.

As I’ve said in the past, I’m happy to continue hosting the lists, but you may 
get better uptime from an alternate service.

In particular, I see that Janet Cobb is now trying to host eir own Mailman 
3-based list.  If that experiment goes well then perhaps the existing lists can 
be migrated over; it would be nice to have continuity of archives.  
Incidentally, I experimented with migrating to Mailman 3 all the way back in 
2017, but people were asking for some customizations to be made [2] and I 
didn’t have the energy to keep working on it, so I just abandoned it.  Maybe 
others will have more energy. :)

- omd

[1] 
https://github.com/haraka/haraka-tld/commit/c507a750c87dcc8bc771f864bbeafc8cb7d8b0f8
[2] https://www.mail-archive.com/agora-discussion@agoranomic.org/msg36383.html

DIS: Another list outage postmortem

Reply via email to