Robert, thanks for your long reply. Personally I'd prefer option 2/3 as it keep Nova the only entity for PCI management.
Glad you are ok with Ian's proposal and we have solution to resolve the libvirt network scenario in that framework. Thanks --jyh > -----Original Message----- > From: Robert Li (baoli) [mailto:ba...@cisco.com] > Sent: Friday, January 17, 2014 7:08 AM > To: OpenStack Development Mailing List (not for usage questions) > Subject: Re: [openstack-dev] [nova] [neutron] PCI pass-through network > support > > Yunhong, > > Thank you for bringing that up on the live migration support. In addition > to the two solutions you mentioned, Irena has a different solution. Let me > put all the them here again: > 1. network xml/group based solution. > In this solution, each host that supports a provider net/physical > net can define a SRIOV group (it's hard to avoid the term as you can see > from the suggestion you made based on the PCI flavor proposal). For each > SRIOV group supported on a compute node, A network XML will be > created the > first time the nova compute service is running on that node. > * nova will conduct scheduling, but not PCI device allocation > * it's a simple and clean solution, documented in libvirt as the > way to support live migration with SRIOV. In addition, a network xml is > nicely mapped into a provider net. > 2. network xml per PCI device based solution > This is the solution you brought up in this email, and Ian > mentioned this to me as well. In this solution, a network xml is created > when A VM is created. the network xml needs to be removed once the > VM is > removed. This hasn't been tried out as far as I know. > 3. interface xml/interface rename based solution > Irena brought this up. In this solution, the ethernet interface > name corresponding to the PCI device attached to the VM needs to be > renamed. One way to do so without requiring system reboot is to change > the > udev rule's file for interface renaming, followed by a udev reload. > > Now, with the first solution, Nova doesn't seem to have control over or > visibility of the PCI device allocated for the VM before the VM is > launched. This needs to be confirmed with the libvirt support and see if > such capability can be provided. This may be a potential drawback if a > neutron plugin requires detailed PCI device information for operation. > Irena may provide more insight into this. Ideally, neutron shouldn't need > this information because the device configuration can be done by libvirt > invoking the PCI device driver. > > The other two solutions are similar. For example, you can view the second > solution as one way to rename an interface, or camouflage an interface > under a network name. They all require additional works before the VM is > created and after the VM is removed. > > I also agree with you that we should take a look at XenAPI on this. > > > With regard to your suggestion on how to implement the first solution with > some predefined group attribute, I think it definitely can be done. As I > have pointed it out earlier, the PCI flavor proposal is actually a > generalized version of the PCI group. In other words, in the PCI group > proposal, we have one predefined attribute called PCI group, and > everything else works on top of that. In the PCI flavor proposal, > attribute is arbitrary. So certainly we can define a particular attribute > for networking, which let's temporarily call sriov_group. But I can see > with this idea of predefined attributes, more of them will be required by > different types of devices in the future. I'm sure it will keep us busy > although I'm not sure it's in a good way. > > I was expecting you or someone else can provide a practical deployment > scenario that would justify the flexibilities and the complexities. > Although I'd prefer to keep it simple and generalize it later once a > particular requirement is clearly identified, I'm fine to go with it if > that's most of the folks want to do. > > --Robert > > > > On 1/16/14 8:36 PM, "yunhong jiang" <yunhong.ji...@linux.intel.com> > wrote: > > >On Thu, 2014-01-16 at 01:28 +0100, Ian Wells wrote: > >> To clarify a couple of Robert's points, since we had a conversation > >> earlier: > >> On 15 January 2014 23:47, Robert Li (baoli) <ba...@cisco.com> wrote: > >> --- do we agree that BDF address (or device id, whatever > >> you call it), and node id shouldn't be used as attributes in > >> defining a PCI flavor? > >> > >> > >> Note that the current spec doesn't actually exclude it as an option. > >> It's just an unwise thing to do. In theory, you could elect to define > >> your flavors using the BDF attribute but determining 'the card in this > >> slot is equivalent to all the other cards in the same slot in other > >> machines' is probably not the best idea... We could lock it out as an > >> option or we could just assume that administrators wouldn't be daft > >> enough to try. > >> > >> > >> * the compute node needs to know the PCI flavor. > >> [...] > >> - to support live migration, we need to > use > >> it to create network xml > >> > >> > >> I didn't understand this at first and it took me a while to get what > >> Robert meant here. > >> > >> This is based on Robert's current code for macvtap based live > >> migration. The issue is that if you wish to migrate a VM and it's > >> tied to a physical interface, you can't guarantee that the same > >> physical interface is going to be used on the target machine, but at > >> the same time you can't change the libvirt.xml as it comes over with > >> the migrating machine. The answer is to define a network and refer > >> out to it from libvirt.xml. In Robert's current code he's using the > >> group name of the PCI devices to create a network containing the list > >> of equivalent devices (those in the group) that can be macvtapped. > >> Thus when the host migrates it will find another, equivalent, > >> interface. This falls over in the use case under consideration where > >> a device can be mapped using more than one flavor, so we have to > >> discard the use case or rethink the implementation. > >> > >> There's a more complex solution - I think - where we create a > >> temporary network for each macvtap interface a machine's going to > use, > >> with a name based on the instance UUID and port number, and > containing > >> the device to map. Before starting the migration we would create a > >> replacement network containing only the new device on the target > host; > >> migration would find the network from the name in the libvirt.xml, and > >> the content of that network would behave identically. We'd be > >> creating libvirt networks on the fly and a lot more of them, and we'd > >> need decent cleanup code too ('when freeing a PCI device, delete any > >> network it's a member of'), so it all becomes a lot more hairy. > >> _______________________________________________ > >> OpenStack-dev mailing list > >> OpenStack-dev@lists.openstack.org > >> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev > > > >Ian/Robert, below is my understanding to the method Robet want to use, > >am I right? > > > >a) Define a libvirt network as "Using a macvtap "direct" connection" > >section at "http://libvirt.org/formatnetwork.html . For example, like > >followed one: > ><network> > > <name> group_name1 </name> > > <forward mode="bridge"> > > <interface dev="eth20"/> > > <interface dev="eth21"/> > > <interface dev="eth22"/> > > <interface dev="eth23"/> > > <interface dev="eth24"/> > > </forward> > > </network> > > > > > >b) When assign SRIOV NIC devices to an instance, as in "Assignment from > >a pool of SRIOV VFs in a libvirt <network> definition" section in > >http://wiki.libvirt.org/page/Networking#PCI_Passthrough_of_host_netw > ork_de > >vices , use libvirt network definition group_name1. For example, like > >followed one: > > > > <interface type='network'> > > <source network='group_name1'> > > </interface> > > > > > >If my understanding is correct, then I have something unclear yet: > >a) How will the libvirt create the libvirt network (i.e. libvirt network > >group_name1)? Will it has be created when compute boot up, or it will > be > >created before instance creation? I suppose per Robert's design, it's > >created when compute node is up, am I right? > > > >b) If all the interface are used up by instance, what will happen. > >Considering that 4 interface allocated to the group_name1 libvirt > >network, and user try to migrate 6 instance with 'group_name1' network, > >what will happen? > > > >And below is my comments: > > > >a) Yes, this is in fact different with the current nova PCI support > >philosophy. Currently we assume Nova owns the devices, manage the > device > >assignment to each instance. While in such situation, libvirt network is > >in fact another layer of PCI device management layer (although very > >thin) ! > > > >b) This also remind me that possibly other VMM like XenAPI has special > >requirement and we need input/confirmation from them also. > > > > > >As how to resolve the issue, I think there are several solution: > > > >a) Create one libvirt network for each SRIOV NIC assigned to each > >instance dynamic, i.e. the libvirt network always has only one interface > >included, it may be static created or dynamical created. This solution > >in fact removes the allocation functionality of the libvirt network and > >leaves only the configuration functionality. > > > >b) Change Nova PCI to support a special type of PCI device attribute > >(like the PCI group). For these PCI attributes , the PCI device > >scheduler will match a PCI devices only if the attributes is specified > >clearly in the PCI flavor. > > Below is an example: > > considering two PCI SRIOV device: > > Dev1: BDF=00:0.1, vendor_id=1, device_id=1, group=grp1 > > Dev2: BDF=00:1.1, vendor_id=1, device_id=2 > > i.e. Dev2 has no group attributes are specified. > > > > And we mark 'group' attribute as special attributes. > > > > Considering follow flavors: > > Flavor1: name=flv1, vendor_id=1 > > Flavor2: name=flv2, vendor_id=1, group=grp1 > > Flavor3: name=flv3, group=grp1. > > > > The Dev1 will never be assigned to flv2. > > This solution try to separate the devices managed by Nova exclusively > >and devices managed by Nova/libvirt together. > > > >Any idea? > > > >Thanks > >--jyh > > > > > >_______________________________________________ > >OpenStack-dev mailing list > >OpenStack-dev@lists.openstack.org > >http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev > > > _______________________________________________ > OpenStack-dev mailing list > OpenStack-dev@lists.openstack.org > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev _______________________________________________ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev