Johan Corveleyn wrote on Fri, 22 May 2020 10:02 +0200: > On Fri, May 22, 2020 at 12:16 AM Daniel Shahaf <d...@daniel.shahaf.name> > wrote: > > > > Johan Corveleyn wrote on Thu, 21 May 2020 17:53 +0200: > > > Trying to summarize this thread a bit. I apologize in advance if I > > > forgot something, or have misrepresented any of the points that were > > > raised (feel free to correct / add). > > > > > > > I have additions, below. > > > > > Denis summed up the following problems that might happen while > > > 'verify' locks the repcache.db: > > > > > > > 1. a post-commit error "database is locked" > > > > 2. new representations will not be added in the rep-cache.db > > > > 3. deduplication does not work for new data committed at this time > > > > 4. commits work with delays. > > > > > > We have also established that the new tool build-repcache is not > > > suitable for post-factum fixing of 3). It does not reprocess already > > > committed revisions. > > > > Note, however, that (3) is simply an observation that Denis made about > > the semantics of the incumbent code. We do not know of a user who needs > > rep-cache.db entries to be added so soon after the commit that running > > «build-repcache» after «verify» finishes would be too late (or impractical). > > > > > We are currently considering two approaches to address these issues: > > > > > > > Hold on. Those issues are what happens *whenever* the post-commmit FS > > processing INSERTs are starved. They happen when a «verify» starves > > INSERTs, but they are known to have at least one other case: when two > > commits happen in quick succession, the INSERTs that happen during > > post-commit FS processing of the first commit can starve the second > > commit's INSERTs. (Don't we have a jira issue for this variant? I > > thought we did, but my searches came up dry.) > > > > Which is to say, the presentation of the matter as "four issues to be > > fixed" isn't accurate. There are _eight_ issues to consider: (1) to (4) > > when they are caused by a «verify» starving INSERTs, and (1) to (4) when > > they are caused by an «INSERT» starving INSERTs. > > > > Thus, your analysis of the pros and cons is incomplete: it glosses over > > the case of an «INSERT» starving other INSERTs. > > Good point. So those "4 problems" are all consequences of "rep-cache > insertion starvation". One possible cause of this might be 'verify', > but there are other known causes, such as 'locked by another commit'. > I'm not sure how likely that last form of starvation is in practice, > though (but if you've seen a jira issue of it, hmmm, maybe). >
It would happen on /repos/asf every time certain PMCs commit a new release's javadocs. (I'm not sure whether it still does; those PMCs may have switched to git since I last looked.) > > Furthermore, over the course of the thread several other ideas have been > > floated, intended to address various subsets of the eight issues. The > > "Perform INSERTs asynchronously" idea, for example (raised by me, but > > inspired by a remark of Brane's), can basically fix all eight issues by > > itself. There was also the idea to stop marshalling post-commit FS > > processing errors to the client, which would fix both variants of (1) > > but none of the other six. I think there were other ideas as well, but > > I've run out of time for this thread for tonight, sorry. > > Ack. > So IIUC these are addressing the consequences of the starvation (which > might help regardless of the cause), rather than taking away the > locking-causes. > Some and some. - The OP is preventative for the «verify» cause and a no-op for other causes. - Sharding rep-cache is preventative: it effectively makes the critical section shorter in duration, for all existing use-cases. - Performing INSERTs asynchronously (and polylithically, i.e., in N O(1)-sized transactions) is preventative for the cases of INSERTs starving either other INSERTs or readers, and a no-op for the case of readers starving INSERTs. (Speaking of which, here's another thought: add an option flag to build-repcache that makes it wait indefinitely in case it's locked out of rep-cache.db by a concurrent process. Denis, WDYT?) - Not marshalling post-commit FS processing errors to the client is purely a failure mode handling thing. - ⋮ > I think both 'angles' are quite useful. We might even want to combine > ... mitigating the consequences, as well as avoiding some of the > causes for the starvation (because even with consequence-mitigation > things might not turn out ideal, or only be mitigated at the cost of > X, so fixing some known causes can still be valuable). > Yeah, for example, the "don't marshal" thing probably makes sense regardless of anything else. There's a reason GitHub doesn't page its users when one of the disks in its RAID arrays needs replacement… Cheers, Daniel