Package: dgit-infrastructure
Version: 13.13
Severity: important

Recently, salsa has been randomly giving 500 errors.  This has caused
at least 3 job failures.  Even if we make the manager try to email the
tag pusher, the acquisition of the email addresss is also a salsa API
query.

I think we need to do retries.  This means:

 * The Manager should be able to take some failed jobs and retry them
   later.  There needs to be some logic to guess whether the failure
   is forge-wide or job-specific.

 * Attempts to fetch using the git protocol need to distinguish "repo
   or tag doesn't exist" from "we weren't able to access the repo".
   Because we're always using https: URLs I think this can be done as
   follows:
     1. Try curl; if that gives 404, declare it irrecoverable.
     2. Run a suitable git-ls-remote; if the command succeeds,
        but wanted refs don't exist, declare it irrecoverable.
     3. Do actual git fetch.
   In steps 2 and 3, treat command failure as retriable.

 * The o2m protocol needs to distinguish retriable from irrecoverable,
   somehow.  We should probably have a point of no return, which
   starts when we call dgit push-source.  If the job fails *before*
   the point of no return, the email should be sent only if this is
   the last attempt.  So the protocol needs to communicate that too.

 * The Manager needs to keep more records, since now there may be
   multiple attempts.  They should be kept *somewhere*.

I think we need to call this a blocker for end of beta.  The current
UX when salsa is doing badly is poor, and we need the UX to be good.

Ian.

-- 
Ian Jackson <[email protected]>   These opinions are my own.  

Pronouns: they/he.  If I emailed you from @fyvzl.net or @evade.org.uk,
that is a private address which bypasses my fierce spamfilter.

Reply via email to