Johan Corveleyn wrote on Fri, 22 May 2020 10:02 +0200:
> On Fri, May 22, 2020 at 12:16 AM Daniel Shahaf <d...@daniel.shahaf.name> 
> wrote:
> >
> > Johan Corveleyn wrote on Thu, 21 May 2020 17:53 +0200:  
> > > Trying to summarize this thread a bit. I apologize in advance if I
> > > forgot something, or have misrepresented any of the points that were
> > > raised (feel free to correct / add).
> > >  
> >
> > I have additions, below.
> >  
> > > Denis summed up the following problems that might happen while
> > > 'verify' locks the repcache.db:
> > >  
> > > > 1. a post-commit error "database is locked"
> > > > 2. new representations will not be added in the rep-cache.db
> > > > 3. deduplication does not work for new data committed at this time
> > > > 4. commits work with delays.  
> > >
> > > We have also established that the new tool build-repcache is not
> > > suitable for post-factum fixing of 3). It does not reprocess already
> > > committed revisions.  
> >
> > Note, however, that (3) is simply an observation that Denis made about
> > the semantics of the incumbent code.  We do not know of a user who needs
> > rep-cache.db entries to be added so soon after the commit that running
> > «build-repcache» after «verify» finishes would be too late (or impractical).
> >  
> > > We are currently considering two approaches to address these issues:
> > >  
> >
> > Hold on.  Those issues are what happens *whenever* the post-commmit FS
> > processing INSERTs are starved.  They happen when a «verify» starves
> > INSERTs, but they are known to have at least one other case: when two
> > commits happen in quick succession, the INSERTs that happen during
> > post-commit FS processing of the first commit can starve the second
> > commit's INSERTs.  (Don't we have a jira issue for this variant?  I
> > thought we did, but my searches came up dry.)
> >
> > Which is to say, the presentation of the matter as "four issues to be
> > fixed" isn't accurate.  There are _eight_ issues to consider: (1) to (4)
> > when they are caused by a «verify» starving INSERTs, and (1) to (4) when
> > they are caused by an «INSERT» starving INSERTs.
> >
> > Thus, your analysis of the pros and cons is incomplete: it glosses over
> > the case of an «INSERT» starving other INSERTs.  
> 
> Good point. So those "4 problems" are all consequences of "rep-cache
> insertion starvation". One possible cause of this might be 'verify',
> but there are other known causes, such as 'locked by another commit'.
> I'm not sure how likely that last form of starvation is in practice,
> though (but if you've seen a jira issue of it, hmmm, maybe).
> 

It would happen on /repos/asf every time certain PMCs commit a new
release's javadocs.  (I'm not sure whether it still does; those PMCs
may have switched to git since I last looked.)

> > Furthermore, over the course of the thread several other ideas have been
> > floated, intended to address various subsets of the eight issues.  The
> > "Perform INSERTs asynchronously" idea, for example (raised by me, but
> > inspired by a remark of Brane's), can basically fix all eight issues by
> > itself.  There was also the idea to stop marshalling post-commit FS
> > processing errors to the client, which would fix both variants of (1)
> > but none of the other six.  I think there were other ideas as well, but
> > I've run out of time for this thread for tonight, sorry.  
> 
> Ack.
> So IIUC these are addressing the consequences of the starvation (which
> might help regardless of the cause), rather than taking away the
> locking-causes.
> 

Some and some.

- The OP is preventative for the «verify» cause and a no-op for other
  causes.

- Sharding rep-cache is preventative: it effectively makes the critical
  section shorter in duration, for all existing use-cases.
  
- Performing INSERTs asynchronously (and polylithically, i.e., in
  N O(1)-sized transactions) is preventative for the cases of INSERTs
  starving either other INSERTs or readers, and a no-op for the case of
  readers starving INSERTs.
  
  (Speaking of which, here's another thought: add an option flag to
  build-repcache that makes it wait indefinitely in case it's locked out
  of rep-cache.db by a concurrent process.  Denis, WDYT?)
  
- Not marshalling post-commit FS processing errors to the client is
  purely a failure mode handling thing.
  
- ⋮

> I think both 'angles' are quite useful. We might even want to combine
> ... mitigating the consequences, as well as avoiding some of the
> causes for the starvation (because even with consequence-mitigation
> things might not turn out ideal, or only be mitigated at the cost of
> X, so fixing some known causes can still be valuable).
> 

Yeah, for example, the "don't marshal" thing probably makes sense
regardless of anything else.  There's a reason GitHub doesn't page its
users when one of the disks in its RAID arrays needs replacement…

Cheers,

Daniel

Reply via email to