On Tue, Feb 7, 2017 at 11:57 AM, Razvan Cojocaru <rcojoc...@bitdefender.com>
wrote:

> On 02/07/2017 08:39 PM, Andrew Cooper wrote:
> > On 07/02/17 18:31, Razvan Cojocaru wrote:
> >> On 02/07/2017 08:15 PM, Tamas K Lengyel wrote:
> >>>
> >>> On Tue, Feb 7, 2017 at 9:53 AM, Razvan Cojocaru
> >>> <rcojoc...@bitdefender.com <mailto:rcojoc...@bitdefender.com>> wrote:
> >>>
> >>>     Hello,
> >>>
> >>>     Setting, e.g. 16 VCPUs for a HVM guest, ends up blocking the guest
> >>>     completely when subscribing to vm_events, apparently because of
> this
> >>>     code in xen/common/vm_event.c:
> >>>
> >>>     315     /* Give this vCPU a black eye if necessary, on the way out.
> >>>     316      * See the comments above wake_blocked() for more
> information
> >>>     317      * on how this mechanism works to avoid waiting. */
> >>>     318     avail_req = vm_event_ring_available(ved);
> >>>     319     if( current->domain == d && avail_req < d->max_vcpus )
> >>>     320         vm_event_mark_and_pause(current, ved);
> >>>
> >>>     It would appear that even if the guest only has 2 online VCPUs, the
> >>>     "avail_req < d->max_vcpus" condition will pause current, and we
> >>>     eventually end up with all the VCPUs paused.
> >>>
> >>>     An ugly hack ("avail_req < 2") has allowed booting a guest with
> many
> >>>     VCPUs (max_vcpus, the guest only brings 2 VCPUs online), however
> that's
> >>>     just to prove that that was the culprit - a real solution to this
> needs
> >>>     more in-depth understading of the issue and potential solution.
> That's
> >>>     basically very old code (pre-2012 at least) that got moved around
> into
> >>>     the current shape of Xen today - please CC anyone relevant to the
> >>>     discussion that you're aware of.
> >>>
> >>>     Thoughts?
> >>>
> >>>
> >>> I think is a side-effect of the growth of the vm_event structure and
> the
> >>> fact that we have a single page ring. The check effectively sets a
> >>> threshold of having enough space for each vCPU to place at least one
> >>> more event on the ring, and if that's not the case it gets paused. OTOH
> >>> I think this would only have an effect on asynchronous events, for all
> >>> other events the vCPU is already paused. Is that the case you have?
> >> No, on the contrary, all my events are synchronous (the VCPU is paused
> >> waiting for the vm_event reply).
> >>
> >> I've debugged this a bit, and the problem seems to be that
> >> vm_event_wake_blocked() breaks here:
> >>
> >> 150     /* We remember which vcpu last woke up to avoid scanning always
> >> linearly
> >> 151      * from zero and starving higher-numbered vcpus under high load
> */
> >> 152     if ( d->vcpu )
> >> 153     {
> >> 154         int i, j, k;
> >> 155
> >> 156         for (i = ved->last_vcpu_wake_up + 1, j = 0; j <
> >> d->max_vcpus; i++, j++)
> >> 157         {
> >> 158             k = i % d->max_vcpus;
> >> 159             v = d->vcpu[k];
> >> 160             if ( !v )
> >> 161                 continue;
> >> 162
> >> 163             if ( !(ved->blocked) || online >= avail_req )
> >> 164                break;
> >> 165
> >> 166             if ( test_and_clear_bit(ved->pause_flag,
> &v->pause_flags) )
> >> 167             {
> >> 168                 vcpu_unpause(v);
> >> 169                 online++;
> >> 170                 ved->blocked--;
> >> 171                 ved->last_vcpu_wake_up = k;
> >> 172             }
> >> 173         }
> >> 174     }
> >>
> >> at "if ( !(ved->blocked) || online >= avail_req )". At this point,
> >> nothing ever gets unblocked. It's hard to believe that this is desired
> >> behaviour, as I don't know what could possibly happen for that condition
> >> to become false once all the online VCPUs are stuck (especially when the
> >> guest has just started booting).
>

Ah I see what happens. During boot vCPU 0 generates an event and gets
marked blocked because the number of vCPUs is so high. The other vCPUs are
still unblocked since they are idle, but this test here will still be true
(online >= avail_req) and thus we can never unblock vCPU0. And then the
boot process is hanging because vCPU0 never resumes. I would argue that
this test should be changed to check that there is at least 1 spot on the
ring and only break if that is not the case anymore (ie. instead of
incrementing online we should be decrementing avail_req).


> >
> > I wouldn't bet that this logic has ever been tested.  If you recall, the
> > addition of register state into the vm_event ring made each entry far
> > larger, which in turns makes it more likely to hit this condition.
> >
> > However, simply fixing the logic to re-online the cpus isn't a good
> > solution either, as having $N vcpus paused at any one time because of
> > ring contention is not conducive good system performance.
> >
> > Realistically, the ring size needs to be max_cpus * sizeof(largest
> > vm_event) at an absolute minimum, and I guess this is now beyond 1 page?
>
> Yes, of course the reason this triggers earlier now is the growth of the
> request's size. Yes, using e.g. 20 VCPUs in the guest's setup will
> exceed a page's number of slots.
>
> And yes, ideally we should have multi-page ring buffers - however that
> is a long-term project that requires design changes in other parts of
> Xen as well (Andrew, CCd here, was recently talking about one).
>
> However, even with a one-page ring buffer, surely it's not good to end
> up in this situation, especially for guests such as mine, which never
> actually bring more than 2 VCPUs online. But even if they were to use
> more, blocking the guest on vm_event init is completely pointless - we
> might as well return some kind of error if max_vcpus > available slots.
>
> I don't follow the system performance argument. Surely completely
> blocking the guest is worse.
>

I also don't see the point in marking a vCPU blocked if it is already
paused. I think this behavior of blocking vCPUs makes only sense for
asynchronous events. Razvan, could you test what happens if
vm_event_mark_and_pause is only called if the vCPU is unpaused?

Tamas
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

Reply via email to