Ian, thanks for your reply. Please check my response prefix with 'yjiang5'.

--jyh

From: Ian Wells [mailto:ijw.ubu...@cack.org.uk]
Sent: Friday, January 10, 2014 4:08 AM
To: OpenStack Development Mailing List (not for usage questions)
Subject: Re: [openstack-dev] [nova] [neutron] PCI pass-through network support

On 10 January 2014 07:40, Jiang, Yunhong 
<yunhong.ji...@intel.com<mailto:yunhong.ji...@intel.com>> wrote:
Robert, sorry that I'm not fan of * your group * term. To me, *your group" 
mixed two thing. It's an extra property provided by configuration, and also 
it's a very-not-flexible mechanism to select devices (you can only select 
devices based on the 'group name' property).

It is exactly that.  It's 0 new config items, 0 new APIs, just an extra tag on 
the whitelists that are already there (although the proposal suggests changing 
the name of them to be more descriptive of what they now do).  And you talk 
about flexibility as if this changes frequently, but in fact the grouping / 
aliasing of devices almost never changes after installation, which is, not 
coincidentally, when the config on the compute nodes gets set up.

1)       A dynamic group is much better. For example, user may want to select 
GPU device based on vendor id, or based on vendor_id+device_id. In another 
word, user want to create group based on vendor_id, or vendor_id+device_id and 
select devices from these group.  John's proposal is very good, to provide an 
API to create the PCI flavor(or alias). I prefer flavor because it's more 
openstack style.
I disagree with this.  I agree that what you're saying offers a more 
flexibilibility after initial installation but I have various issues with it.
[yjiang5] I think you talking is mostly about white list, instead of PCI 
flavor. PCI flavor is more about PCI request, like I want to have a device with 
"vendor_id = cisco, device_id= 15454E", or 'vendor_id=intel device_class=nic' , 
( because the image have the driver for all Intel NIC card :)  ). While 
whitelist is to decide the device that is assignable in a host.
"

This is directly related to the hardware configuation on each compute node.  
For (some) other things of this nature, like provider networks, the compute 
node is the only thing that knows what it has attached to it, and it is the 
store (in configuration) of that information.  If I add a new compute node then 
it's my responsibility to configure it correctly on attachment, but when I add 
a compute node (when I'm setting the cluster up, or sometime later on) then 
it's at that precise point that I know how I've attached it and what hardware 
it's got on it.  Also, it's at this that point in time that I write out the 
configuration file (not by hand, note; there's almost certainly automation when 
configuring hundreds of nodes so arguments that 'if I'm writing hundreds of 
config files one will be wrong' are moot).

I'm also not sure there's much reason to change the available devices 
dynamically after that, since that's normally an activity that results from 
changing the physical setup of the machine which implies that actually you're 
going to have access to and be able to change the config as you do it.  John 
did come up with one case where you might be trying to remove old GPUs from 
circulation, but it's a very uncommon case that doesn't seem worth coding for, 
and it's still achievable by changing the config and restarting the compute 
processes.
[yjiag5] I totally agree with you that whitelist is static defined when 
provision. I just want to separate the information of 'provider network' to 
another configuration (like extra information). Whitelist is just white list to 
decide the device assignable. The provider network is information of the 
device, it's not in the scope of the white list.
This also reduces the autonomy of the compute node in favour of centralised 
tracking, which goes against the 'distributed where possible' philosophy of 
Openstack.
Finally, you're not actually removing configuration from the compute node.  You 
still have to configure a whitelist there; in the grouping design you also have 
to configure grouping (flavouring) on the control node as well.  The groups 
proposal adds one extra piece of information to the whitelists that are already 
there to mark groups, not a whole new set of config lines.
[yjiang5] Still, while list is to decide the device assignable, not to provide 
device information. We should mixed functionality to the configuration. If it's 
ok, I simply want to discard the 'group' term :) The nova PCI flow is simple, 
compute node provide PCI device (based on white list), the scheduler track the 
PCI device information (abstracted as pci_stats for performance issue), the API 
provide method that user specify the device they wanted (the PCI flavor). 
Current implementation need enhancement on each step of the flow, but I really 
see no reason to have the "Group". Yes, the 'PCI flavor' in fact create group 
based on PCI property, but it's better to be expressed as flavor.

To compare scheduling behaviour:

If I  need 4G of RAM, each compute node has reported its summary of free RAM to 
the scheduler.  I look for a compute node with 4G free, and filter the list of 
compute nodes down.  This is a query on n records, n being the number of 
compute nodes.  I schedule to the compute node, which then confirms it does 
still have 4G free and runs the VM or rejects the request.
If I need 3 PCI devices and use the current system, each machine has reported 
its device allocations to the scheduler.  With SRIOV multiplying up the number 
of available devices, it's reporting back hundreds of records per compute node 
to the schedulers, and the filtering activity is a 3 queries on n * number of 
PCI devices in cloud records, which could easily end up in the tens or even 
hundreds of thousands of records for a moderately sized cloud.  There compute 
node also has a record of its device allocations which is also checked and 
updated before the final request is run.
If I need 3 PCI devices and use the groups system, each machine has reported 
its device *summary* to the scheduler.  With SRIOV multiplying up the number of 
available devices, it's still reporting one or a small number of categories, 
i.e. { net: 100}.  The difficulty of scheduling is a query on num groups * n 
records - fewer, in fact, if some machines have no passthrough devices.

[yjiang5] That's the reason we have the pci_stats. The PCI stats is a * summary 
* for PCI devices information based on *selected* PCI property, like vendor_id, 
device_id. If we assume the all VFs has the same vendor_id/device_id, it will 
becomes in fact only one entry in the pci_stats! However, we still keep the 
detailed information like vendor_id/device_id in the scheduler for decision 
making, instead of the opaque 'group name'.  And with a configuration to select 
which property to be used for the pci_stats, like 'vendor_id' only, or 
'vendor_id/device_id', it's much flexible.  And if extend the Nova PCI to have 
user defined property, you can simply add property like 'net' to all your 
assignable devices, and then configure the 'net' as the only property to get 
the pci_stats, that's exactly the implementation as your idea !

You can see that there's quite a cost to be paid for having those flexible 
alias APIs.
4)       IMHO, the core for nova PCI support is *PCI property*. The property 
means not only generic PCI devices like vendor id, device id, device type, 
compute specific property like BDF address, the adjacent switch IP address,  
but also user defined property like nuertron's physical net name etc. And then, 
it's about how to get these property, how to select/group devices based on the 
property, how to store/fetch these properties.

The thing about this is that you don't always, or even often, want to select by 
property.  Some of these properties are just things that you need to tell 
Neutron, they're not usually keys for scheduling.
[yjiang5] Yes, that's the reason of pci_stats, which use only selected property 
for scheduling. But I don't want to fixed the selected property to be only 
'groupname'!

Thanks
--jyh
_______________________________________________
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Reply via email to