On Thu, Sep 22, 2022 at 05:41:30PM +0100, Dr. David Alan Gilbert wrote:
> * Peter Xu (pet...@redhat.com) wrote:
> > On Thu, Sep 22, 2022 at 03:49:38PM +0100, Dr. David Alan Gilbert wrote:
> > > * Peter Xu (pet...@redhat.com) wrote:
> > > > When starting ram saving procedure (especially at the completion phase),
> > > > always set last_seen_block to non-NULL to make sure we can always 
> > > > correctly
> > > > detect the case where "we've migrated all the dirty pages".
> > > > 
> > > > Then we'll guarantee both last_seen_block and pss.block will be valid
> > > > always before the loop starts.
> > > > 
> > > > See the comment in the code for some details.
> > > > 
> > > > Signed-off-by: Peter Xu <pet...@redhat.com>
> > > 
> > > Yeh I guess it can currently only happen during restart?
> > 
> > There're only two places to clear last_seen_block:
> > 
> > ram_state_reset[2683]          rs->last_seen_block = NULL;
> > ram_postcopy_send_discard_bitmap[2876] rs->last_seen_block = NULL;
> > 
> > Where for the reset case:
> > 
> > ram_state_init[2994]           ram_state_reset(*rsp);
> > ram_state_resume_prepare[3110] ram_state_reset(rs);
> > ram_save_iterate[3271]         ram_state_reset(rs);
> > 
> > So I think it can at least happen in two places, either (1) postcopy just
> > started (assume when postcopy starts accidentally when all dirty pages were
> > migrated?), or (2) postcopy recover from failure.
> 
> Oh, (1) is a more general problem then; yeh.
> 
> > In my case I triggered this deadloop when I was debugging the other bug
> > fixed by the next patch where it was postcopy recovery (on tls), but only
> > once..  So currently I'm still not 100% sure whether this is the same
> > problem, but logically it could trigger.
> > 
> > I also remember I used to hit very rare deadloops before too, maybe they're
> > the same thing because I did test recovery a lot.
> 
> Note; 'deadlock' not 'deadloop'.

(Oops I somehow forgot there's still this series pending..)

Here it's not about a lock, or maybe I should add a space ("dead loop")?

-- 
Peter Xu


Reply via email to