Hi Jonathan, If I understand correctly, that bug is about multiple scheduler instances(processes) doing scheduler at the same time. When compute node found itself unable to fulfil a create_instance request, it'll resend the request back to scheduler (max_retry is to avoid endless retry). From your description, I only see one scheduler. And you are right, even memory may have some issue but about cpu_allocation_ratio should have limit scheduler to put instances with more vCPUs then pCPUs. What openstack package are you using?
On Wed, Oct 31, 2012 at 11:41 PM, Jonathan Proulx <j...@jonproulx.com> wrote: > Hi All > > While the RetryScheduler may not have been designed specifically to > fix this issue https://bugs.launchpad.net/nova/+bug/1011852 suggests > that it is meant to fix it, well if "it" is a scheduler race condition > which is my suspicion. > > This is my current scheduler config which gives the failure mode I describe: > > scheduler_available_filters=nova.scheduler.filters.standard_filters > scheduler_default_filters=AvailabilityZoneFilter,RamFilter,CoreFilter,ComputeFilte > r,RetryFilter > scheduler_max_attempts=30 > least_cost_functions=nova.scheduler.least_cost.compute_fill_first_cost_fn > compute_fill_first_cost_fn_weight=1.0 > cpu_allocation_ratio=1.0 > ram_allocation_ratio=1.0 > > I'm running the scheduler and api server on a single controller host > and it's pretty consistent about scheduling >hundred instances per > node at first then iteratively rescheduling them elsewhere when > presented with either an single API request to start many instances > (using euca2ools) or a shell loop around nova boot to generate one api > request per server. > > the cpu_allocation ratio should limit the scheduler to 24 instances > per compute node regardless how how it's calculating memory, so while > I talked a lot about memory allocation as a motivation it is more > frequent for cpu to actually be the limiting factor in my deployment > and it certainly should. > > And yet after attempting to launch 200 m1.tiny instances: > > root@nimbus-0:~# nova-manage service describe_resource nova-23 > 2012-10-31 11:17:56 > HOST PROJECT cpu mem(mb) hdd > nova-23 (total) 24 48295 882 > nova-23 (used_now) 107 56832 30 > nova-23 (used_max) 107 56320 30 > nova-23 98333a1a28e746fa8c629c83a818ad57 106 > 54272 0 > nova-23 3008a142e9524f7295b06ea811908f93 1 > 2048 30 > > eventually those bleed off to other systems though not entirely > > 2012-10-31 11:29:41 > HOST PROJECT cpu mem(mb) hdd > nova-23 (total) 24 48295 882 > nova-23 (used_now) 43 24064 30 > nova-23 (used_max) 43 23552 30 > nova-23 98333a1a28e746fa8c629c83a818ad57 42 > 21504 0 > nova-23 3008a142e9524f7295b06ea811908f93 1 > 2048 30 > > at this point 12min later out of 200 instances 168 are active 22 are > errored and 10 are still "building". Notably only 23 actual VMs are > running on "nova-23": > > root@nova-23:~# virsh list|grep instance |wc -l > 23 > > So that's what I see perhaps my assumptions about why I'm seeing it > are incorrect > > Thanks, > -Jon -- Regards Huang Zhiteng _______________________________________________ Mailing list: https://launchpad.net/~openstack Post to : openstack@lists.launchpad.net Unsubscribe : https://launchpad.net/~openstack More help : https://help.launchpad.net/ListHelp