On Mon, Dec 14, 2020 at 11:28 AM Amul Sul <sula...@gmail.com> wrote: > > On Thu, Dec 10, 2020 at 6:04 AM Andres Freund <and...@anarazel.de> wrote: > > > > Hi, > > > > On 2020-12-09 16:13:06 -0500, Robert Haas wrote: > > > That's not good. On a typical busy system, a system is going to be in > > > the middle of a checkpoint most of the time, and the checkpoint will > > > take a long time to finish - maybe minutes. > > > > Or hours, even. Due to the cost of FPWs it can make a lot of sense to > > reduce the frequency of that cost... > > > > > > > We want this feature to respond within milliseconds or a few seconds, > > > not minutes. So we need something better here. > > > > Indeed. > > > > > > > I'm inclined to think > > > that we should try to CompleteWALProhibitChange() at the same places > > > we AbsorbSyncRequests(). We know from experience that bad things > > > happen if we fail to absorb sync requests in a timely fashion, so we > > > probably have enough calls to AbsorbSyncRequests() to make sure that > > > we always do that work in a timely fashion. So, if we do this work in > > > the same place, then it will also be done in a timely fashion. > > > > Sounds sane, without having looked in detail. > > > > Understood & agreed that we need to change the system state as soon as > possible. > > I can see AbsorbSyncRequests() is called from 4 routing as > CheckpointWriteDelay(), ProcessSyncRequests(), SyncPostCheckpoint() and > CheckpointerMain(). Out of 4, the first three executes with an interrupt is > on > hod which will cause a problem when we do emit barrier and wait for those > barriers absorption by all the process including itself and will cause an > infinite wait. I think that can be fixed by teaching > WaitForProcSignalBarrier(), > do not wait on self to absorb barrier. Let that get absorbed at a later point > in time when the interrupt is resumed. I assumed that we cannot do barrier > processing right away since there could be other barriers (maybe in the > future) > including ours that should not process while the interrupt is on hold. >
CreateCheckPoint() holds CheckpointLock LW at start and releases at the end which puts interrupt on hold. This kinda surprising that we were holding this lock and putting interrupt on hots for a long time. We do need that CheckpointLock just to ensure that one checkpoint happens at a time. Can't we do something easy to ensure that instead of the lock? Probably holding off interrupts for so long doesn't seem to be a good idea. Thoughts/Suggestions? Regards, Amul