Package: dgit-infrastructure
Version: 13.13
Severity: important
Recently, salsa has been randomly giving 500 errors. This has caused
at least 3 job failures. Even if we make the manager try to email the
tag pusher, the acquisition of the email addresss is also a salsa API
query.
I think we need to do retries. This means:
* The Manager should be able to take some failed jobs and retry them
later. There needs to be some logic to guess whether the failure
is forge-wide or job-specific.
* Attempts to fetch using the git protocol need to distinguish "repo
or tag doesn't exist" from "we weren't able to access the repo".
Because we're always using https: URLs I think this can be done as
follows:
1. Try curl; if that gives 404, declare it irrecoverable.
2. Run a suitable git-ls-remote; if the command succeeds,
but wanted refs don't exist, declare it irrecoverable.
3. Do actual git fetch.
In steps 2 and 3, treat command failure as retriable.
* The o2m protocol needs to distinguish retriable from irrecoverable,
somehow. We should probably have a point of no return, which
starts when we call dgit push-source. If the job fails *before*
the point of no return, the email should be sent only if this is
the last attempt. So the protocol needs to communicate that too.
* The Manager needs to keep more records, since now there may be
multiple attempts. They should be kept *somewhere*.
I think we need to call this a blocker for end of beta. The current
UX when salsa is doing badly is poor, and we need the UX to be good.
Ian.
--
Ian Jackson <[email protected]> These opinions are my own.
Pronouns: they/he. If I emailed you from @fyvzl.net or @evade.org.uk,
that is a private address which bypasses my fierce spamfilter.