Well, (not in any kind of priority order), the issues/ideas I've heard are:
1) Don't send data to scheduler, have the scheduler poll each compute node on each schedule request. I'm pretty sure that, in a large cloud with 100s if not 1000s of compute nodes, this would add too much latency to the schedule request. 2) Fan-out data is too laggy. I don't understand why the fan-out messages should be any more laggy than updating the DB. In both case a message has to be sent to a remote server (fan-out messages to the scheduler, DB updates to the DB server). Given that the bulk of the lag should be in I would expect the lags to be approximately equivalent. 3) Fan-out messages are too infrequent. Currently, the fan-out messages only go out on a periodic basis (right now every 60 seconds) which does lead to stale data. I believe that the DB is being updated on every state change, I would suggest that we just send a fan-out message on every state change rather than updating the DB. If my point 2 is correct then this should require the same overhead and so shouldn't be a problem. 4) One suggestion was to re-architect the scheduler to `remember` resources between requests. This seems like a major effort, as pointed out, potentially raises coherency issues and if we do fan-outs on every state change is not needed. -- Don Dugger "Censeo Toto nos in Kansa esse decisse." - D. Gale Ph: 303/443-3786 -----Original Message----- From: Wang, Shane [mailto:shane.w...@intel.com] Sent: Tuesday, June 18, 2013 6:13 AM To: OpenStack Development Mailing List Subject: Re: [openstack-dev] Compute node stats sent to the scheduler Hi, I am new in this area. I got an idea but didn't know whether that works. Fanout_cast is expensive and DB could be a burden. Can we maintain the stat data at nodes, and when and only when a scheduler needs to do any scheduling, the scheduler proactively to ask nodes their stats? The assumption is scheduling doesn't happen frequently, compared with the frequency of fanout_cast? Best Regards. -- Shane Brian Elliott wrote onĀ 2013-06-18: > > On Jun 17, 2013, at 3:50 PM, Chris Behrens <cbehr...@codestud.com> wrote: > >> >> On Jun 17, 2013, at 7:49 AM, Russell Bryant <rbry...@redhat.com> wrote: >> >>> On 06/16/2013 11:25 PM, Dugger, Donald D wrote: >>>> Looking into the scheduler a bit there's an issue of duplicated effort >>>> that is a > little puzzling. The database table `compute_nodes' is being updated > periodically with data about capabilities and resources used (memory, vcpus, > ...) > while at the same time a periodic RPC call is being made to the scheduler > sending > pretty much the same data. >>>> >>>> Does anyone know why we are updating the same data in two different > place using two different mechanisms? Also, assuming we were to remove one > of these updates, which one should go? (I thought at one point in time there > was a goal to create a database free compute node which would imply we should > remove the DB update.) >>> >>> Have you looked around to see if any code is using the data from the db? >>> >>> Having schedulers hit the db for the current state of all compute nodes >>> all of the time would be a large additional db burden that I think we >>> should avoid. So, it makes sense to keep the rpc fanout_cast of current >>> stats to schedulers. >> >> This is actually what the scheduler uses. :) The fanout messages are too > infrequent and can be too laggy. So, the scheduler was moved to using the DB > a long, long time ago. but it was very inefficient, at first, because it > looped > through all instances. So we added things we needed into compute_node and > compute_node_stats so we only had to look at the hosts. You have to pull the > hosts anyway, so we pull the stats at the same time. >> >> The problem is. when we stopped using certain data from the fanout > messages.. we never removed it. We should AT LEAST do this. But.. (see > below).. >> >>> >>> The scheduler also does a fanout_cast to all compute nodes when it >>> starts up to trigger the compute nodes to populate the cache in the >>> scheduler. It would be nice to never fanout_cast to all compute nodes >>> (given that there may be a *lot* of them). We could replace this with >>> having the scheduler populate its cache from the database. >> >> I think we should audit the remaining things that the scheduler uses from >> these > messages and move them to the DB. I believe it's limited to the hypervisor > capabilities to compare against aggregates or some such. I believe it's > things > that change very rarely. so an alternative can be to only send fanout messages > when capabilities change! We could always do that as a first step. >> >>> >>> Removing the db usage completely would be nice if nothing is actually >>> using it, but we'd have to look into an alternative solution for >>> removing the scheduler fanout_cast to compute. >> >> Relying on anything but the DB for current memory free, etc, is just >> too laggy. so we need to stick with it, IMO. >> >> - Chris >> >> >> _______________________________________________ >> OpenStack-dev mailing list >> OpenStack-dev@lists.openstack.org >> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev > > As Chris said, the reason it ended up this way using the DB is to quickly get > up to > date usage on hosts to the scheduler. I certainly understand the point that > it's a > whole lot of increased load on the DB, but the RPC data was quite stale. If > there > is interest in moving away from the DB updates, I think we have to either: > > 1) Send RPC updates to scheduler on essentially every state change > during a build. > > or > > 2) Change the scheduler architecture so there is some "memory" of > resources consumed between requests. The scheduler would have to > remember which hosts recent builds were assigned to. This could be a > bit of a data synchronization problem. if you're talking about using > multiple scheduler instances. > > Brian > _______________________________________________ > OpenStack-dev mailing list > OpenStack-dev@lists.openstack.org > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev _______________________________________________ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev _______________________________________________ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev