On Tue, Feb 7, 2017 at 3:09 PM, Mihai Donțu <mihai.do...@gmail.com> wrote:

> On Tue, 7 Feb 2017 22:41:57 +0200 Razvan Cojocaru wrote:
> > On 02/07/2017 10:20 PM, Tamas K Lengyel wrote:
> > > On Tue, Feb 7, 2017 at 11:57 AM, Razvan Cojocaru wrote:
> > >     On 02/07/2017 08:39 PM, Andrew Cooper wrote:
> > >     > On 07/02/17 18:31, Razvan Cojocaru wrote:
> > >     >> On 02/07/2017 08:15 PM, Tamas K Lengyel wrote:
> > >     >>> On Tue, Feb 7, 2017 at 9:53 AM, Razvan Cojocaru wrote:
> > >     >>>     Hello,
> > >     >>>
> > >     >>>     Setting, e.g. 16 VCPUs for a HVM guest, ends up blocking
> the
> > >     guest
> > >     >>>     completely when subscribing to vm_events, apparently
> because
> > >     of this
> > >     >>>     code in xen/common/vm_event.c:
> > >     >>>
> > >     >>>     315     /* Give this vCPU a black eye if necessary, on the
> > >     way out.
> > >     >>>     316      * See the comments above wake_blocked() for more
> > >     information
> > >     >>>     317      * on how this mechanism works to avoid waiting. */
> > >     >>>     318     avail_req = vm_event_ring_available(ved);
> > >     >>>     319     if( current->domain == d && avail_req <
> d->max_vcpus )
> > >     >>>     320         vm_event_mark_and_pause(current, ved);
> > >     >>>
> > >     >>>     It would appear that even if the guest only has 2 online
> > >     VCPUs, the
> > >     >>>     "avail_req < d->max_vcpus" condition will pause current,
> and we
> > >     >>>     eventually end up with all the VCPUs paused.
> > >     >>>
> > >     >>>     An ugly hack ("avail_req < 2") has allowed booting a guest
> > >     with many
> > >     >>>     VCPUs (max_vcpus, the guest only brings 2 VCPUs online),
> > >     however that's
> > >     >>>     just to prove that that was the culprit - a real solution
> to
> > >     this needs
> > >     >>>     more in-depth understading of the issue and potential
> > >     solution. That's
> > >     >>>     basically very old code (pre-2012 at least) that got moved
> > >     around into
> > >     >>>     the current shape of Xen today - please CC anyone relevant
> > >     to the
> > >     >>>     discussion that you're aware of.
> > >     >>>
> > >     >>>     Thoughts?
> > >     >>>
> > >     >>>
> > >     >>> I think is a side-effect of the growth of the vm_event
> structure
> > >     and the
> > >     >>> fact that we have a single page ring. The check effectively
> sets a
> > >     >>> threshold of having enough space for each vCPU to place at
> least one
> > >     >>> more event on the ring, and if that's not the case it gets
> > >     paused. OTOH
> > >     >>> I think this would only have an effect on asynchronous events,
> > >     for all
> > >     >>> other events the vCPU is already paused. Is that the case you
> have?
> > >     >> No, on the contrary, all my events are synchronous (the VCPU is
> > >     paused
> > >     >> waiting for the vm_event reply).
> > >     >>
> > >     >> I've debugged this a bit, and the problem seems to be that
> > >     >> vm_event_wake_blocked() breaks here:
> > >     >>
> > >     >> 150     /* We remember which vcpu last woke up to avoid scanning
> > >     always
> > >     >> linearly
> > >     >> 151      * from zero and starving higher-numbered vcpus under
> > >     high load */
> > >     >> 152     if ( d->vcpu )
> > >     >> 153     {
> > >     >> 154         int i, j, k;
> > >     >> 155
> > >     >> 156         for (i = ved->last_vcpu_wake_up + 1, j = 0; j <
> > >     >> d->max_vcpus; i++, j++)
> > >     >> 157         {
> > >     >> 158             k = i % d->max_vcpus;
> > >     >> 159             v = d->vcpu[k];
> > >     >> 160             if ( !v )
> > >     >> 161                 continue;
> > >     >> 162
> > >     >> 163             if ( !(ved->blocked) || online >= avail_req )
> > >     >> 164                break;
> > >     >> 165
> > >     >> 166             if ( test_and_clear_bit(ved->pause_flag,
> > >     &v->pause_flags) )
> > >     >> 167             {
> > >     >> 168                 vcpu_unpause(v);
> > >     >> 169                 online++;
> > >     >> 170                 ved->blocked--;
> > >     >> 171                 ved->last_vcpu_wake_up = k;
> > >     >> 172             }
> > >     >> 173         }
> > >     >> 174     }
> > >     >>
> > >     >> at "if ( !(ved->blocked) || online >= avail_req )". At this
> point,
> > >     >> nothing ever gets unblocked. It's hard to believe that this is
> > >     desired
> > >     >> behaviour, as I don't know what could possibly happen for that
> > >     condition
> > >     >> to become false once all the online VCPUs are stuck (especially
> > >     when the
> > >     >> guest has just started booting).
> > >
> > >
> > > Ah I see what happens. During boot vCPU 0 generates an event and gets
> > > marked blocked because the number of vCPUs is so high. The other vCPUs
> > > are still unblocked since they are idle, but this test here will still
> > > be true (online >= avail_req) and thus we can never unblock vCPU0. And
> > > then the boot process is hanging because vCPU0 never resumes. I would
> > > argue that this test should be changed to check that there is at least
> 1
> > > spot on the ring and only break if that is not the case anymore (ie.
> > > instead of incrementing online we should be decrementing avail_req).
> >
> > That is exactly what happens. And it can't really be fixed just by
> > increasing the ring buffer (although that definitely helps a lot and
> > would be a smart move): no matter how large it is, we can always ask the
> > domain to use more VCPUs than there are slots in the buffer.
> >
> > >     >
> > >     > I wouldn't bet that this logic has ever been tested.  If you
> > >     recall, the
> > >     > addition of register state into the vm_event ring made each
> entry far
> > >     > larger, which in turns makes it more likely to hit this
> condition.
> > >     >
> > >     > However, simply fixing the logic to re-online the cpus isn't a
> good
> > >     > solution either, as having $N vcpus paused at any one time
> because of
> > >     > ring contention is not conducive good system performance.
> > >     >
> > >     > Realistically, the ring size needs to be max_cpus *
> sizeof(largest
> > >     > vm_event) at an absolute minimum, and I guess this is now beyond
> 1
> > >     page?
> > >
> > >     Yes, of course the reason this triggers earlier now is the growth
> of the
> > >     request's size. Yes, using e.g. 20 VCPUs in the guest's setup will
> > >     exceed a page's number of slots.
> > >
> > >     And yes, ideally we should have multi-page ring buffers - however
> that
> > >     is a long-term project that requires design changes in other parts
> of
> > >     Xen as well (Andrew, CCd here, was recently talking about one).
> > >
> > >     However, even with a one-page ring buffer, surely it's not good to
> end
> > >     up in this situation, especially for guests such as mine, which
> never
> > >     actually bring more than 2 VCPUs online. But even if they were to
> use
> > >     more, blocking the guest on vm_event init is completely pointless
> - we
> > >     might as well return some kind of error if max_vcpus > available
> slots.
> > >
> > >     I don't follow the system performance argument. Surely completely
> > >     blocking the guest is worse.
> > >
> > >
> > > I also don't see the point in marking a vCPU blocked if it is already
> > > paused. I think this behavior of blocking vCPUs makes only sense for
> > > asynchronous events. Razvan, could you test what happens if
> > > vm_event_mark_and_pause is only called if the vCPU is unpaused?
> >
> > It works for me with this change (using Xen 4.7 sources here):
> >
> > @@ -318,7 +329,11 @@ void vm_event_put_request(struct domain *d,
> >       * on how this mechanism works to avoid waiting. */
> >      avail_req = vm_event_ring_available(ved);
> >      if( current->domain == d && avail_req < d->max_vcpus )
> > -        vm_event_mark_and_pause(current, ved);
> > +    {
> > +        if ( !atomic_read( &current->vm_event_pause_count ) )
> > +            vm_event_mark_and_pause(current, ved);
> > +    }
>
> If I'm reading the code correctly, when max_vcpus is greater than the
> number of slots available in the ring, a race appears that can lead to
> a ring corruption (in debug mode ASSERT(free_req > 0) will trigger).
>
> For example, when a single slot is available, two vCPUs can race to
> vm_event_put_request() after both being given a green light in
> __vm_event_claim_slot(), whose return depends only on
> vm_event_ring_available() returning non-zero (which it can do, for both
> vCPUs at the same time).
>
> As it turns out, the bug being talked about prevented this from showing
> up.
>
> PS: https://xenbits.xen.org/gitweb/?p=xen.git;a=commit;h=
> 3643a961195f76ba849a213628c1979240e6fbdd


I was also looking at this but that's not the case. The function
__vm_event_claim_slot will eventually trickle down to vm_event_grab_slot,
which does have a lock. While holding the lock it calls
vm_event_ring_available which returns how many free spots are on the ring
(number of items the ring can hold - ved->target_producers -
ved->foreign_producers). So if it returns 1 it will be grabbed, and either
ved->target_producers or ved->foreign_producers gets incremented, and only
then we unlock. The next caller to vm_event_grab_slot will not find a free
spot. If that happens the caller to __vm_event_claim_slot will either gets
put on the waitqueue (if sleeping_allowed is set) or returns -EBUSY (if
losing the event is permissible).

Tamas
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

Reply via email to