Re: [openstack-dev] [nova] [cyborg] Race condition in the Cyborg/Nova flow

Nadathur, Sundar Sat, 31 Mar 2018 19:39:17 -0700

Hi Eric and all,

Thank you very much for considering my concerns and coming backwith an improved solution. Glad that no blood was shed in the process.

I took this proposal and worked out its details, as I understand them,in this etherpad:

     https://etherpad.openstack.org/p/Cyborg-Nova-Multifunction

The intention of this detailed scheme is to include GPUs, FPGAs and alldevices, but the focus may be more on FPGAs.

This scheme at first keeps the restriction that a multi-function devicecannot be reprogrammed but, in the last section, explores which part ofthe sky will fall down if we do allow that. May be we'll get throughthis with tears but no blood!


Have a good rest of the weekend.

Regards,
Sundar

On 3/29/2018 9:43 AM, Eric Fried wrote:

We discussed this on IRC [1], hangout, and etherpad [2].  Here is the
summary, which we mostly seem to agree on:

There are two different classes of device we're talking about
modeling/managing.  (We don't know the real nomenclature, so forgive
errors in that regard.)

==> Fully dynamic: You can program one region with one function, and
then still program a different region with a different function, etc.

==> Single program: Once you program the card with a function, *all* its
virtual slots are *only* capable of that function until the card is
reprogrammed.  And while any slot is in use, you can't reprogram.  This
is Sundar's FPGA use case.  It is also Sylvain's VGPU use case.

The "fully dynamic" case is straightforward (in the sense of being what
placement was architected to handle).
* Model the PF/region as a resource provider.
* The RP has inventory of some generic resource class (e.g. "VGPU",
"SRIOV_NET_VF", "FPGA_FUNCTION").  Allocations consume that inventory,
plain and simple.
* As a region gets programmed dynamically, it's acceptable for the thing
doing the programming to set a trait indicating that that function is in
play.  (Sundar, this is the thing I originally said would get
resistance; but we've agreed it's okay.  No blood was shed :)
* Requests *may* use preferred traits to help them land on a card that
already has their function flashed on it. (Prerequisite: preferred
traits, which can be implemented in placement.  Candidates with the most
preferred traits get sorted highest.)

The "single program" case needs to be handled more like what Alex
describes below.  TL;DR: We do *not* support dynamic programming,
traiting, or inventorying at instance boot time - it all has to be done
"up front".
* The PFs can be initially modeled as "empty" resource providers.  Or
maybe not at all.  Either way, *they can not be deployed* in this state.
* An operator or admin (via a CLI, config file, agent like blazar or
cyborg, etc.) preprograms the PF to have the specific desired
function/configuration.
   * This may be cyborg/blazar pre-programming devices to maintain an
available set of each function
   * This may be in response to a user requesting some function, which
causes a new image to be laid down on a device so it will be available
for scheduling
   * This may be a human doing it at cloud-build time
* This results in the resource provider being (created and) set up with
the inventory and traits appropriate to that function.
* Now deploys can happen, using required traits representing the desired
function.

-efried

[1]
http://eavesdrop.openstack.org/irclogs/%23openstack-nova/%23openstack-nova.2018-03-29.log.html#t2018-03-29T12:52:56
[2] https://etherpad.openstack.org/p/placement-dynamic-traiting

On 03/29/2018 07:38 AM, Alex Xu wrote:

Agree with that, whatever the tweak inventory or traits, none of them works.

Same as VGPU, we can support pre-programmed mode for multiple-functions
region, and each region only can support one type function.

There are two reasons why Cyborg has a filter:
* records the usage of functions in a region
* records which function is programmed.

For #1, each region provider multiple functions. Each function can be
assigned to a VM. So we should create ResourceProvider for the region. And
the resource class is function. That is similar to the SR-IOV device.
The region(The PF)
provides functions (VFs).

For #2, We should use trait to distinguish the function type.

Then we didn't keep any inventory info in the cyborg again, and we
needn't any filter in cyborg also,
and there is no race condition anymore.

2018-03-29 2:48 GMT+08:00 Eric Fried <openst...@fried.cc
<mailto:openst...@fried.cc>>:

     Sundar-

             We're running across this issue in several places right
     now.   One
     thing that's definitely not going to get traction is
     automatically/implicitly tweaking inventory in one resource class when
     an allocation is made on a different resource class (whether in the same
     or different RPs).

             Slightly less of a nonstarter, but still likely to get
     significant
     push-back, is the idea of tweaking traits on the fly.  For example, your
     vGPU case might be modeled as:

     PGPU_RP: {
       inventory: {
           CUSTOM_VGPU_TYPE_A: 2,
           CUSTOM_VGPU_TYPE_B: 4,
       }
       traits: [
           CUSTOM_VGPU_TYPE_A_CAPABLE,
           CUSTOM_VGPU_TYPE_B_CAPABLE,
       ]
     }

             The request would come in for
     resources=CUSTOM_VGPU_TYPE_A:1&required=VGPU_TYPE_A_CAPABLE, resulting
     in an allocation of CUSTOM_VGPU_TYPE_A:1.  Now while you're processing
     that, you would *remove* CUSTOM_VGPU_TYPE_B_CAPABLE from the PGPU_RP.
     So it doesn't matter that there's still inventory of
     CUSTOM_VGPU_TYPE_B:4, because a request including
     required=CUSTOM_VGPU_TYPE_B_CAPABLE won't be satisfied by this RP.
     There's of course a window between when the initial allocation is made
     and when you tweak the trait list.  In that case you'll just have to
     fail the loser.  This would be like any other failure in e.g. the spawn
     process; it would bubble up, the allocation would be removed; retries
     might happen or whatever.

             Like I said, you're likely to get a lot of resistance to
     this idea as
     well.  (Though TBH, I'm not sure how we can stop you beyond -1'ing your
     patches; there's nothing about placement that disallows it.)

             The simple-but-inefficient solution is simply that we'd
     still be able
     to make allocations for vGPU type B, but you would have to fail right
     away when it came down to cyborg to attach the resource.  Which is code
     you pretty much have to write anyway.  It's an improvement if cyborg
     gets to be involved in the post-get-allocation-candidates
     weighing/filtering step, because you can do that check at that point to
     help filter out the candidates that would fail.  Of course there's still
     a race condition there, but it's no different than for any other
     resource.

     efried

     On 03/28/2018 12:27 PM, Nadathur, Sundar wrote:
     > Hi Eric and all,
     >     I should have clarified that this race condition happens only for
     > the case of devices with multiple functions. There is a prior thread
     >
     <http://lists.openstack.org/pipermail/openstack-dev/2018-March/127882.html
     
<http://lists.openstack.org/pipermail/openstack-dev/2018-March/127882.html>>
     > about it. I was trying to get a solution within Cyborg, but that faces
     > this race condition as well.
     >
     > IIUC, this situation is somewhat similar to the issue with vGPU types
     >
     
<http://eavesdrop.openstack.org/irclogs/%23openstack-nova/%23openstack-nova.2018-03-27.log.html#t2018-03-27T13:41:00
     
<http://eavesdrop.openstack.org/irclogs/%23openstack-nova/%23openstack-nova.2018-03-27.log.html#t2018-03-27T13:41:00>>
     > (thanks to Alex Xu for pointing this out). In the latter case, we
     could
     > start with an inventory of (vgpu-type-a: 2; vgpu-type-b: 4).  But,
     after
     > consuming a unit of  vGPU-type-a, ideally the inventory should change
     > to: (vgpu-type-a: 1; vgpu-type-b: 0). With multi-function
     accelerators,
     > we start with an RP inventory of (region-type-A: 1, function-X:
     4). But,
     > after consuming a unit of that function, ideally the inventory should
     > change to: (region-type-A: 0, function-X: 3).
     >
     > I understand that this approach is controversial :) Also, one
     difference
     > from the vGPU case is that the number and count of vGPU types is
     static,
     > whereas with FPGAs, one could reprogram it to result in more or fewer
     > functions. That said, we could hopefully keep this analogy in mind for
     > future discussions.
     >
     > We probably will not support multi-function accelerators in Rocky.
     This
     > discussion is for the longer term.
     >
     > Regards,
     > Sundar
     >
     > On 3/23/2018 12:44 PM, Eric Fried wrote:
     >> Sundar-
     >>
     >>      First thought is to simplify by NOT keeping inventory
     information in
     >> the cyborg db at all.  The provider record in the placement service
     >> already knows the device (the provider ID, which you can look up
     in the
     >> cyborg db) the host (the root_provider_uuid of the provider
     representing
     >> the device) and the inventory, and (I hope) you'll be augmenting
     it with
     >> traits indicating what functions it's capable of.  That way, you'll
     >> always get allocation candidates with devices that *can* load the
     >> desired function; now you just have to engage your weigher to
     prioritize
     >> the ones that already have it loaded so you can prefer those.
     >>
     >>      Am I missing something?
     >>
     >>              efried
     >>
     >> On 03/22/2018 11:27 PM, Nadathur, Sundar wrote:
     >>> Hi all,
     >>>     There seems to be a possibility of a race condition in the
     >>> Cyborg/Nova flow. Apologies for missing this earlier. (You can
     refer to
     >>> the proposed Cyborg/Nova spec
     >>>
     
<https://review.openstack.org/#/c/554717/1/doc/specs/rocky/cyborg-nova-sched.rst
     
<https://review.openstack.org/#/c/554717/1/doc/specs/rocky/cyborg-nova-sched.rst>>
     >>> for details.)
     >>>
     >>> Consider the scenario where the flavor specifies a resource
     class for a
     >>> device type, and also specifies a function (e.g. encrypt) in the
     extra
     >>> specs. The Nova scheduler would only track the device type as a
     >>> resource, and Cyborg needs to track the availability of functions.
     >>> Further, to keep it simple, say all the functions exist all the
     time (no
     >>> reprogramming involved).
     >>>
     >>> To recap, here is the scheduler flow for this case:
     >>>
     >>>   * A request spec with a flavor comes to Nova
     conductor/scheduler. The
     >>>     flavor has a device type as a resource class, and a function
     in the
     >>>     extra specs.
     >>>   * Placement API returns the list of RPs (compute nodes) which
     contain
     >>>     the requested device types (but not necessarily the function).
     >>>   * Cyborg will provide a custom filter which queries Cyborg DB.
     This
     >>>     needs to check which hosts contain the needed function, and
     filter
     >>>     out the rest.
     >>>   * The scheduler selects one node from the filtered list, and the
     >>>     request goes to the compute node.
     >>>
     >>> For the filter to work, the Cyborg DB needs to maintain a table with
     >>> triples of (host, function type, #free units). The filter checks
     if a
     >>> given host has one or more free units of the requested function
     type.
     >>> But, to keep the # free units up to date, Cyborg on the selected
     compute
     >>> node needs to notify the Cyborg API to decrement the #free units
     when an
     >>> instance is spawned, and to increment them when resources are
     released.
     >>>
     >>> Therein lies the catch: this loop from the compute node to
     controller is
     >>> susceptible to race conditions. For example, if two simultaneous
     >>> requests each ask for function A, and there is only one unit of that
     >>> available, the Cyborg filter will approve both, both may land on the
     >>> same host, and one will fail. This is because Cyborg on the
     controller
     >>> does not decrement resource usage due to one request before
     processing
     >>> the next request.
     >>>
     >>> This is similar to this previous Nova scheduling issue
     >>>
     
<https://specs.openstack.org/openstack/nova-specs/specs/pike/implemented/placement-claims.html
     
<https://specs.openstack.org/openstack/nova-specs/specs/pike/implemented/placement-claims.html>>.
     >>> That was solved by having the scheduler claim a resource in
     Placement
     >>> for the selected node. I don't see an analog for Cyborg, since
     it would
     >>> not know which node is selected.
     >>>
     >>> Thanks in advance for suggestions and solutions.
     >>>
     >>> Regards,
     >>> Sundar
     >>>
     >>>
     >>>
     >>>
     >>>
     >>>
     >>>
     >>>
     >>>
     __________________________________________________________________________
     >>> OpenStack Development Mailing List (not for usage questions)
     >>> Unsubscribe:
     openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
     <http://openstack-dev-requ...@lists.openstack.org?subject:unsubscribe>
     >>>
     http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
     <http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev>
     >>>
     >>
     __________________________________________________________________________
     >> OpenStack Development Mailing List (not for usage questions)
     >> Unsubscribe:
     openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
     <http://openstack-dev-requ...@lists.openstack.org?subject:unsubscribe>
     >> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
     <http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev>
     >
     >
     >
     >
     __________________________________________________________________________
     > OpenStack Development Mailing List (not for usage questions)
     > Unsubscribe:
     openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
     <http://openstack-dev-requ...@lists.openstack.org?subject:unsubscribe>
     > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
     <http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev>
     >

     __________________________________________________________________________
     OpenStack Development Mailing List (not for usage questions)
     Unsubscribe:
     openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
     <http://openstack-dev-requ...@lists.openstack.org?subject:unsubscribe>
     http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
     <http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev>




__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev



__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] [nova] [cyborg] Race condition in the Cyborg/Nova flow

Reply via email to