Good news!

Changing Inspector to make parallel execution (threads) for reset server state 
did the trick and now we are back in business ☺
Doctor test case now goes through with any decent number of VMs on failing host:
With 20VMs where 10 VMs on failing host: 340ms
(This used to be: With 20VMs where 10 VMs on failing host: 1540ms to 1780ms)
Works also with bigger amount of VMs:
With 100VMs where 50 VMs on failing host: 550ms

Have a nice weekend.

Br,
Tomi

From: opnfv-tech-discuss-boun...@lists.opnfv.org 
[mailto:opnfv-tech-discuss-boun...@lists.opnfv.org] On Behalf Of Juvonen, Tomi 
(Nokia - FI/Espoo)
Sent: Wednesday, October 05, 2016 10:27 AM
To: 'Ryota Mibu' <r-m...@cq.jp.nec.com>; 'Yujun Zhang' 
<zhangyujun+...@gmail.com>; 'opnfv-tech-discuss@lists.opnfv.org' 
<opnfv-tech-discuss@lists.opnfv.org>
Subject: Suspected SPAM - Re: [opnfv-tech-discuss] [Doctor] Reset Server State 
and alarms in general

Hi,

Furthermore it seems that making the payload to notification in Nova is the 
pain point, so I do not think there is much to optimize in that as of the 
general usage of that. This takes ~115ms in Nokia POD for each VM.

That would leave options:

·        Can notifications/reset server state be parallel (quick and dirty fix 
if even possible)

·        Can there be new notification.

o   Can there be new tenant specific notification instead of notification for 
each VM. Meaning if tenant has 10 VMs on a failing host, there would only be 
one tenant specific notification including all his VMs. Payload should not be 
as heavy as currently as this would be for the tenant specific alarm of his VMs 
on failing host. Also as not so many notifications, should not have the 
cumulative problem of wasting time in forming several notifications. Also one 
could subscribe to this tenant specifically and not like now that one need to 
subscribe alarms VM id specifically. Currently this would still remain as 
downside of current implementation if need to alarm only single VM failure, 
making the tenant alarm subscribing not very convenient.

§  This can be achieved by having this new notification done by force-down API 
and the unwanted reset server state could be removed.

§  This can be achieved by having this new notification done by Inspector and 
the unwanted reset server state could be removed. (faster, can be parallel to 
force-down call).

o   Needed state information is already done by force-down API as of 
implementation of “get valid server state BP in Nova” and therefore no reset 
server state is needed. Inspector should do the needed notification. This would 
have the fastest execution time where information do not need to flow through 
nova to notifier and has it easy to have parallel execution to notification and 
force-down API. This is the right thing in long run (while technically speaking 
of “Telco grade” there would be even more to enhance to have things as quickly 
as possible, like one should subscribe alarms directly from Inspector; but we 
cannot achieve everything that easily and that will be far in future.)

Br,
Tomi


From: Juvonen, Tomi (Nokia - FI/Espoo)
Sent: Tuesday, October 04, 2016 3:08 PM
To: Ryota Mibu <r-m...@cq.jp.nec.com<mailto:r-m...@cq.jp.nec.com>>; Yujun Zhang 
<zhangyujun+...@gmail.com<mailto:zhangyujun+...@gmail.com>>; 
opnfv-tech-discuss@lists.opnfv.org<mailto:opnfv-tech-discuss@lists.opnfv.org>
Subject: Re: [opnfv-tech-discuss] [Doctor] Reset Server State and alarms in 
general

Hi,

Still modified the test so that I do not do the reset server state, but instead 
just make the notification about “reset server state error” for each instance 
when force-down API called:
for instance in instances:
            notifications.send_update_with_states(context, instance,
            instance.vm_state, vm_states.ERROR,instance.task_state,
            None, service="compute", host=host,verify_states=False)

This had the same result as trough instance.save() that also changes the DB. So 
didn’t make things any better.

Br,
Tomi

From: 
opnfv-tech-discuss-boun...@lists.opnfv.org<mailto:opnfv-tech-discuss-boun...@lists.opnfv.org>
 [mailto:opnfv-tech-discuss-boun...@lists.opnfv.org] On Behalf Of Juvonen, Tomi 
(Nokia - FI/Espoo)
Sent: Tuesday, October 04, 2016 12:30 PM
To: Ryota Mibu <r-m...@cq.jp.nec.com<mailto:r-m...@cq.jp.nec.com>>; Yujun Zhang 
<zhangyujun+...@gmail.com<mailto:zhangyujun+...@gmail.com>>; 
opnfv-tech-discuss@lists.opnfv.org<mailto:opnfv-tech-discuss@lists.opnfv.org>
Subject: Suspected SPAM - Re: [opnfv-tech-discuss] [Doctor] Reset Server State 
and alarms in general

Hi,


1.      Tried that 300 while it is also the default, so no difference.



2.      Then I modified force-down API that it internally makes reset server 
state for all instances on host (so it is the only API called from inspector) 
and was no difference:
With 10VMs where 5 VMs on failing host: 1000ms
With 10VMs where 5 VMs on failing host: 1040ms
With 20VMs where 10 VMs on failing host: 1540ms to 1780ms


3.      Then added debug print over this code in modified forced-down API that 
gets servers and does reset state for each.
Run Doctor test case:
With 20VMs where 10 VMs on failing host: 1540ms
In Nova code:
Getting instances:
instances = self.host_api.instance_get_all_by_host(context, host)
Took: 32ms
Looping 10 instances to make reset server state to error:
for instance in instances:
            instance.vm_state = vm_states.ERROR
            instance.task_state = None
            instance.save(admin_state_reset=True)
Took: 1250ms
And can then also pick up the whole time the API took:
2016-10-04 09:05:46.075 5029 INFO nova.osapi_compute.wsgi.server 
[req-368d7fa5-dad6-4805-b9ed-535bf05fff06 b175813579a14b5d9eafe759a1d3e392 
1dedc52c8caa42b8aea83b913035f5d9 - - -] 192.0.2.6 "PUT 
/v2.1/1dedc52c8caa42b8aea83b913035f5d9/os-services/force-down HTTP/1.1" status: 
200 len: 354 time: 1.4085381

So the usage of reset server state is currently not feasible (and like 
indicated before, shouldn’t even be used).

Br,
Tomi

From: Ryota Mibu [mailto:r-m...@cq.jp.nec.com]
Sent: Saturday, October 01, 2016 8:54 AM
To: Yujun Zhang <zhangyujun+...@gmail.com<mailto:zhangyujun+...@gmail.com>>; 
Juvonen, Tomi (Nokia - FI/Espoo) 
<tomi.juvo...@nokia.com<mailto:tomi.juvo...@nokia.com>>; 
opnfv-tech-discuss@lists.opnfv.org<mailto:opnfv-tech-discuss@lists.opnfv.org>
Subject: RE: [opnfv-tech-discuss] [Doctor] Reset Server State and alarms in 
general

Hi,


That’s interesting evaluation!

Yes, this should be an important issue.

I doubt that token validation in keystone might take major part of processing 
time. If so, we should consider to use keystone trust which can skip the 
validation or to use token caches. Tomi, can you try the same evaluation with 
enabling token caches in client (by --os-cache) and nova ([keystone_authtoken] 
token_cache_time=300)?

OR, maybe we can check how many HTTP messages and DB query happened per VM 
reset?


Thanks,
Ryota

From: 
opnfv-tech-discuss-boun...@lists.opnfv.org<mailto:opnfv-tech-discuss-boun...@lists.opnfv.org>
 [mailto:opnfv-tech-discuss-boun...@lists.opnfv.org] On Behalf Of Yujun Zhang
Sent: Friday, September 30, 2016 5:53 PM
To: Juvonen, Tomi (Nokia - FI/Espoo) 
<tomi.juvo...@nokia.com<mailto:tomi.juvo...@nokia.com>>; 
opnfv-tech-discuss@lists.opnfv.org<mailto:opnfv-tech-discuss@lists.opnfv.org>
Subject: Re: [opnfv-tech-discuss] [Doctor] Reset Server State and alarms in 
general

It is almost linear to the number of VMs since the requests are sent one by 
one. And I think we should raise the priority of this issue.

But I wonder what it will perform if the requests are sent simultaneously with 
async calls. How will nova deal with that?
On Fri, Sep 30, 2016 at 4:25 PM Juvonen, Tomi (Nokia - FI/Espoo) 
<tomi.juvo...@nokia.com<mailto:tomi.juvo...@nokia.com>> wrote:
Hi,
Run Doctor test case in Nokia POD with APEX installer and state of the art 
Airframe HW. Modified the Doctor test case so that I can run several VMs and 
consumer can receive alarms from those. Measuring is it possible to stay within 
doctor requirement of under 1second from recognizing the fault to having alarm 
to consumer. This way I can see how much overhead comes when more VMs on 
failing host (overhead comes from calling reset server state API for each VM on 
failing host).

Here is how many milliseconds it took to get scenario trough:
With 1 VM on failing host: 180ms
With 10VMs where 5 VMs on failing host: 800ms to 1040ms
With 20VMs where 12 VMs on failing host: 2410ms
With 20VMs where 13 VMs on failing host: 2010ms
With 20VMs where 11 VMs on failing host: 2380ms
With 50VMs where 27 VMs on failing host: 5060ms
With 100VMs where 49 VMs on failing host: 8180ms

Conclusion: With ideal environment one can run 5 VMs on a host and still 
fulfill Doctor requirement. So this needs to be enhanced.
_______________________________________________
opnfv-tech-discuss mailing list
opnfv-tech-discuss@lists.opnfv.org
https://lists.opnfv.org/mailman/listinfo/opnfv-tech-discuss

Reply via email to