> In my experience there can be failures affecting a single host or a > single cron job where no jobs run at all, or no emails from that host > get delivered. In the absence of monitoring, such failures ere not > noticed until they cause some other symptom. > > We want that symptom to be "we get mail from cron job A that cron job > B has stopped working". This pattern is one I have used elsewhere and > I have indeed had some cron mails. > > Or to put it another way, each monitoring cron job comes with an > inherent risk of ineffectiveness because "it didn't run" and "it > failed but the report went into an oubliette" are indistinguishable > from "it's working". A design with only one *monitoring* cron job, > and a bunch of separate function jobs that actually do something, > reduces that risk and makes mitigations for it (manual testing) much > easier.
Okay. > Secondly, "copy this key out to the central place" is a much simpler > thing to deploy and will almost never need to be updated. If we put > the more complex check-it's-all-ok logic everywhere then we have to > keep it up to date everywhere. This makes me think that there is in fact already a system like this: the thing that copies our public key from tag2upload-manager-01 to ftp-master. Seems like you could talk to DSA about just reusing that to copy the key around for our own purposes. -- Sean Whitton

