On Thu, Sep 22, 2022 at 05:41:30PM +0100, Dr. David Alan Gilbert wrote: > * Peter Xu (pet...@redhat.com) wrote: > > On Thu, Sep 22, 2022 at 03:49:38PM +0100, Dr. David Alan Gilbert wrote: > > > * Peter Xu (pet...@redhat.com) wrote: > > > > When starting ram saving procedure (especially at the completion phase), > > > > always set last_seen_block to non-NULL to make sure we can always > > > > correctly > > > > detect the case where "we've migrated all the dirty pages". > > > > > > > > Then we'll guarantee both last_seen_block and pss.block will be valid > > > > always before the loop starts. > > > > > > > > See the comment in the code for some details. > > > > > > > > Signed-off-by: Peter Xu <pet...@redhat.com> > > > > > > Yeh I guess it can currently only happen during restart? > > > > There're only two places to clear last_seen_block: > > > > ram_state_reset[2683] rs->last_seen_block = NULL; > > ram_postcopy_send_discard_bitmap[2876] rs->last_seen_block = NULL; > > > > Where for the reset case: > > > > ram_state_init[2994] ram_state_reset(*rsp); > > ram_state_resume_prepare[3110] ram_state_reset(rs); > > ram_save_iterate[3271] ram_state_reset(rs); > > > > So I think it can at least happen in two places, either (1) postcopy just > > started (assume when postcopy starts accidentally when all dirty pages were > > migrated?), or (2) postcopy recover from failure. > > Oh, (1) is a more general problem then; yeh. > > > In my case I triggered this deadloop when I was debugging the other bug > > fixed by the next patch where it was postcopy recovery (on tls), but only > > once.. So currently I'm still not 100% sure whether this is the same > > problem, but logically it could trigger. > > > > I also remember I used to hit very rare deadloops before too, maybe they're > > the same thing because I did test recovery a lot. > > Note; 'deadlock' not 'deadloop'.
(Oops I somehow forgot there's still this series pending..) Here it's not about a lock, or maybe I should add a space ("dead loop")? -- Peter Xu