* Peter Xu (pet...@redhat.com) wrote: > On Thu, Sep 22, 2022 at 03:49:38PM +0100, Dr. David Alan Gilbert wrote: > > * Peter Xu (pet...@redhat.com) wrote: > > > When starting ram saving procedure (especially at the completion phase), > > > always set last_seen_block to non-NULL to make sure we can always > > > correctly > > > detect the case where "we've migrated all the dirty pages". > > > > > > Then we'll guarantee both last_seen_block and pss.block will be valid > > > always before the loop starts. > > > > > > See the comment in the code for some details. > > > > > > Signed-off-by: Peter Xu <pet...@redhat.com> > > > > Yeh I guess it can currently only happen during restart? > > There're only two places to clear last_seen_block: > > ram_state_reset[2683] rs->last_seen_block = NULL; > ram_postcopy_send_discard_bitmap[2876] rs->last_seen_block = NULL; > > Where for the reset case: > > ram_state_init[2994] ram_state_reset(*rsp); > ram_state_resume_prepare[3110] ram_state_reset(rs); > ram_save_iterate[3271] ram_state_reset(rs); > > So I think it can at least happen in two places, either (1) postcopy just > started (assume when postcopy starts accidentally when all dirty pages were > migrated?), or (2) postcopy recover from failure.
Oh, (1) is a more general problem then; yeh. > In my case I triggered this deadloop when I was debugging the other bug > fixed by the next patch where it was postcopy recovery (on tls), but only > once.. So currently I'm still not 100% sure whether this is the same > problem, but logically it could trigger. > > I also remember I used to hit very rare deadloops before too, maybe they're > the same thing because I did test recovery a lot. Note; 'deadlock' not 'deadloop'. Dave > > > > Reviewed-by: Dr. David Alan Gilbert <dgilb...@redhat.com> > > Thanks! > > -- > Peter Xu > -- Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK