Re: upgrading 4.5.2 -> 4.6.0 virtualrouter upgrade timeout

Remi Bergsma Tue, 05 Jan 2016 03:21:08 -0800

Hi Andrei,

You indeed need to build CloudStack for this to work.


You can create packages with ./packaging/package.sh script in the source tree. 
The PR is against 4.7 and when you create RPMs those will be 4.7.1-SHAPSHOT. I 
do run this in production and it resolved the issue. Let me know if it works 
for you too.

Regards,
Remi




On 05/01/16 10:07, "Andrei Mikhailovsky" <[email protected]> wrote:

>Hi Remi,
>
>I've not tried the patch. I've missed it. Do I need to rebuild the ACS to 
>apply the patch or would making changes to the two files suffice?
>
>Thanks
>
>Andrei
>----- Original Message -----
>> From: "Remi Bergsma" <[email protected]>
>> To: "dev" <[email protected]>
>> Sent: Tuesday, 5 January, 2016 05:49:05
>> Subject: Re: upgrading 4.5.2 -> 4.6.0 virtualrouter upgrade timeout
>
>> Hi Andrei,
>> 
>> Did you try it in combination with the patch I created (PR1291)? You need 
>> both
>> changes.
>> 
>> Regards, Remi
>> 
>> Sent from my iPhone
>> 
>>> On 04 Jan 2016, at 22:17, Andrei Mikhailovsky <[email protected]> wrote:
>>> 
>>> Hi Remi,
>>> 
>>> Thanks for your reply. However, your suggestion of increasing the
>>> router.aggregation.command.each.timeout didn't help. I've tried setting the
>>> value to 120 at no avail. Still fails with the same error.
>>> 
>>> Andrei
>>> 
>>> ----- Original Message -----
>>>> From: "Remi Bergsma" <[email protected]>
>>>> To: "dev" <[email protected]>
>>>> Sent: Monday, 4 January, 2016 10:44:43
>>>> Subject: Re: upgrading 4.5.2 -> 4.6.0 virtualrouter upgrade timeout
>>> 
>>>> Hi Andrei,
>>>> 
>>>> Missed that mail, sorry. I created a PR that allows for longer timeouts 
>>>> [1].
>>>> 
>>>> Also, you can bump the router.aggregation.command.each.timeout global 
>>>> setting to
>>>> say 15-30 so it will allow to boot.
>>>> 
>>>> Next, we need to find why it takes so long in the first place. In our
>>>> environment it at least starts now.
>>>> 
>>>> Regards,
>>>> Remi
>>>> 
>>>> [1] https://github.com/apache/cloudstack/pull/1291
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>>> On 04/01/16 11:41, "Andrei Mikhailovsky" <[email protected]> wrote:
>>>>> 
>>>>> Hello guys,
>>>>> 
>>>>> Tried the user's mailing list without any luck. Perhaps the dev guys know 
>>>>> if
>>>>> this issue is being looked at for the next release?
>>>>> 
>>>>> I've just upgraded to 4.6.2 and have similar issues with three virtual 
>>>>> routers
>>>>> out of 22 in total. They are all failing exactly the same way as described
>>>>> here.
>>>>> 
>>>>> Has anyone found a permanent workaround for this issue?
>>>>> 
>>>>> Thanks
>>>>> 
>>>>> Andrei
>>>>> 
>>>>> ----- Original Message -----
>>>>>> From: "Stephan Seitz" <[email protected]>
>>>>>> To: "users" <[email protected]>
>>>>>> Sent: Monday, 30 November, 2015 19:53:57
>>>>>> Subject: Re: upgrading 4.5.2 -> 4.6.0 virtualrouter upgrade timeout
>>>>> 
>>>>>> Does anybody else experiemce problems due to (very) slow deployment of
>>>>>> VRs?
>>>>>> 
>>>>>> 
>>>>>> Am Dienstag, den 24.11.2015, 16:31 +0100 schrieb Stephan Seitz:
>>>>>>> Update / FYI:
>>>>>>> After faking the particular VRu in sql, I tried to restart that
>>>>>>> network,
>>>>>>> and it always fails. To me it looks like the update_config.py - which
>>>>>>> takes almost all cpu ressources - runs way longer any watchdog will
>>>>>>> accept.
>>>>>>> 
>>>>>>> I'm able to mitigate that by very nasty workarounds:
>>>>>>> a) start the router
>>>>>>> b) wait until its provisioned
>>>>>>> c) restart cloudstack-management
>>>>>>> d)  update vm_instance
>>>>>>> set state='Running',
>>>>>>> power_state='PowerOn' where name = 'r-XXX-VM';
>>>>>>> e) once: update domain_router
>>>>>>> set template_version="Cloudstack Release 4.6.0 Wed Nov 4 08:22:47 UTC
>>>>>>> 2015",
>>>>>>> scripts_version="546c9e7ac38e0aa16ecc498899dac8e2"
>>>>>>> where id=XXX;
>>>>>>> f) wait until update_config.py finishes (for me thats about 15
>>>>>>> minutes)
>>>>>>> 
>>>>>>> Since I expect the need for VR restarts in the future, this behaviour
>>>>>>> is
>>>>>>> somehow unsatisfying. It needs a lot of errorprone intervention.
>>>>>>> 
>>>>>>> I'm quite unsure if it's introduced with the update or the particular
>>>>>>> VR
>>>>>>> just has simply not been restarted after getting configured with lots
>>>>>>> of
>>>>>>> ips and rules.
>>>>>>> 
>>>>>>> 
>>>>>>> Am Dienstag, den 24.11.2015, 12:29 +0100 schrieb Stephan Seitz:
>>>>>>>> Hi List!
>>>>>>>> 
>>>>>>>> After upgrading from 4.5.2 to 4.6.0 I faced a problem with one
>>>>>>>> virtualrouter. This particular VR has about 10 IPs w/ LB and FW
>>>>>>>> rules
>>>>>>>> defined. During the upgrade process, and after about 4-5 minutes a
>>>>>>>> watchdog kicks in and kills the respective VR due to no response.
>>>>>>>> 
>>>>>>>> So far I didn't find any timeout value in the global settings.
>>>>>>>> Temporarily setting network.router.EnableServiceMonitoring to false
>>>>>>>> doesn't change the behaviour.
>>>>>>>> 
>>>>>>>> Any help, how to mitigate that nasty timeout would be really
>>>>>>>> appreciated :)
>>>>>>>> 
>>>>>>>> cheers,
>>>>>>>> 
>>>>>>>> Stephan
>>>>>>>> 
>>>>>>>> From within the VR, the logs show
>>>>>>>> 
>>>>>>>> 2015-11-24 11:24:33,807  CsFile.py search:123 Searching for
>>>>>>>> dhcp-range=interface:eth0,set:interface and replacing with
>>>>>>>> dhcp-range=interface:eth0,set:interface-eth0,10.10.22.1,static
>>>>>>>> 2015-11-24 11:24:33,808  merge.py load:56 Creating data bag type
>>>>>>>> guestnetwork
>>>>>>>> 2015-11-24 11:24:33,808  CsFile.py search:123 Searching for
>>>>>>>> dhcp-option=tag:interface-eth0,15 and replacing with
>>>>>>>> dhcp-option=tag:interface-eth0,15,heinlein.cloudservice
>>>>>>>> 2015-11-24 11:24:33,808  CsFile.py search:123 Searching for
>>>>>>>> dhcp-option=tag:interface-eth0,6 and replacing with
>>>>>>>> dhcp-option=tag:interface
>>>>>>>> -eth0,6,10.10.22.1,195.10.208.2,91.198.250.2
>>>>>>>> 2015-11-24 11:24:33,809  CsFile.py search:123 Searching for
>>>>>>>> dhcp-option=tag:interface-eth0,3, and replacing with
>>>>>>>> dhcp-option=tag:interface-eth0,3,10.10.22.1
>>>>>>>> 2015-11-24 11:24:33,809  CsFile.py search:123 Searching for
>>>>>>>> dhcp-option=tag:interface-eth0,1, and replacing with
>>>>>>>> dhcp-option=tag:interface-eth0,1,255.255.255.0
>>>>>>>> 2015-11-24 11:24:33,810  CsHelper.py execute:160 Executing: service
>>>>>>>> dnsmasq restart
>>>>>>>> 
>>>>>>>> ==> /var/log/messages <==
>>>>>>>> Nov 24 11:24:34 r-504-VM shutdown[6752]: shutting down for system
>>>>>>>> halt
>>>>>>>> 
>>>>>>>> Broadcast message from root@r-504-VM (Tue Nov 24 11:24:34 2015):
>>>>>>>> 
>>>>>>>> The system is going down for system halt NOW!
>>>>>>>> Nov 24 11:24:35 r-504-VM KVP: KVP starting; pid is:6844
>>>>>>>> 
>>>>>>>> ==> /var/log/cloud.log <==
>>>>>>>> /opt/cloud/bin/vr_cfg.sh: line 60:  6603
>>>>>>>> Killed                  /opt/cloud/bin/update_config.py
>>>>>>>> vm_dhcp_entry.json
>>>>>>>> 
>>>>>>>> ==> /var/log/messages <==
>>>>>>>> Nov 24 11:24:35 r-504-VM cloud: VR config: executing
>>>>>>>> failed: /opt/cloud/bin/update_config.py vm_dhcp_entry.json
>>>>>>>> 
>>>>>>>> ==> /var/log/cloud.log <==
>>>>>>>> Tue Nov 24 11:24:35 UTC 2015 : VR config: executing
>>>>>>>> failed: /opt/cloud/bin/update_config.py vm_dhcp_entry.json
>>>>>>>> Connection to 169.254.2.192 closed by remote host.
>>>>>>>> Connection to 169.254.2.192 closed.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> the management-server.log shows
>>>>>>>> 
>>>>>>>> 2015-11-24 12:24:43,015 DEBUG [o.a.c.f.j.i.AsyncJobManagerImpl]
>>>>>>>> (Work-Job-Executor-1:ctx-ad9e4658 job-5163/job-5164) Done executing
>>>>>>>> com.cloud.vm.VmWorkStart for job-5164
>>>>>>>> 2015-11-24 12:24:43,017 INFO  [o.a.c.f.j.i.AsyncJobMonitor]
>>>>>>>> (Work-Job-Executor-1:ctx-ad9e4658 job-5163/job-5164) Remove job
>>>>>>>> -5164
>>>>>>>> from job monitoring
>>>>>>>> 2015-11-24 12:24:43,114 ERROR [c.c.a.ApiAsyncJobDispatcher]
>>>>>>>> (API-Job-Executor-1:ctx-760da779 job-5163) Unexpected exception
>>>>>>>> while
>>>>>>>> executing org.apache.cloudstack.api.command.admin.
>>>>>>>> router.StartRouterCmd
>>>>>>>> com.cloud.exception.AgentUnavailableException: Resource [Host:1] is
>>>>>>>> unreachable: Host 1: Unable to start instance due to Unable to
>>>>>>>> start
>>>>>>>> VM[DomainRouter|r-504-VM] due to error in f
>>>>>>>> inalizeStart, not retrying
>>>>>>>>        at
>>>>>>>> com.cloud.vm.VirtualMachineManagerImpl.orchestrateStart(VirtualMach
>>>>>>>> ineManagerImpl.java:1121)
>>>>>>>>        at
>>>>>>>> com.cloud.vm.VirtualMachineManagerImpl.orchestrateStart(VirtualMach
>>>>>>>> ineManagerImpl.java:4580)
>>>>>>>>        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native
>>>>>>>> Method)
>>>>>>>>        at
>>>>>>>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImp
>>>>>>>> l.java:57)
>>>>>>>>        at
>>>>>>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAcc
>>>>>>>> essorImpl.java:43)
>>>>>>>>        at java.lang.reflect.Method.invoke(Method.java:606)
>>>>>>>>        at
>>>>>>>> com.cloud.vm.VmWorkJobHandlerProxy.handleVmWorkJob(VmWorkJobHandler
>>>>>>>> Proxy.java:107)
>>>>>>>>        at
>>>>>>>> com.cloud.vm.VirtualMachineManagerImpl.handleVmWorkJob(VirtualMachi
>>>>>>>> neManagerImpl.java:4736)
>>>>>>>>        at
>>>>>>>> com.cloud.vm.VmWorkJobDispatcher.runJob(VmWorkJobDispatcher.java:10
>>>>>>>> 2)
>>>>>>>>        at
>>>>>>>> org.apache.cloudstack.framework.jobs.impl.AsyncJobManagerImpl
>>>>>>>> $5.runInContext(AsyncJobManagerImpl.java:537)
>>>>>>>>        at
>>>>>>>> org.apache.cloudstack.managed.context.ManagedContextRunnable
>>>>>>>> $1.run(ManagedContextRunnable.java:49)
>>>>>>>>        at
>>>>>>>> org.apache.cloudstack.managed.context.impl.DefaultManagedContext
>>>>>>>> $1.call(DefaultManagedContext.java:56)
>>>>>>>>        at
>>>>>>>> org.apache.cloudstack.managed.context.impl.DefaultManagedContext.ca
>>>>>>>> llWithContext(DefaultManagedContext.java:103)
>>>>>>>>        at
>>>>>>>> org.apache.cloudstack.managed.context.impl.DefaultManagedContext.ru
>>>>>>>> nWithContext(DefaultManagedContext.java:53)
>>>>>>>>        at
>>>>>>>> org.apache.cloudstack.managed.context.ManagedContextRunnable.run(Ma
>>>>>>>> nagedContextRunnable.java:46)
>>>>>>>>        at
>>>>>>>> org.apache.cloudstack.framework.jobs.impl.AsyncJobManagerImpl
>>>>>>>> $5.run(AsyncJobManagerImpl.java:494)
>>>>>>>>        at java.util.concurrent.Executors
>>>>>>>> $RunnableAdapter.call(Executors.java:471)
>>>>>>>>        at java.util.concurrent.FutureTask.run(FutureTask.java:262)
>>>>>>>>        at
>>>>>>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecuto
>>>>>>>> r.java:1145)
>>>>>>>>        at java.util.concurrent.ThreadPoolExecutor
>>>>>>>> $Worker.run(ThreadPoolExecutor.java:615)
>>>>>>>>        at java.lang.Thread.run(Thread.java:745)
>>>>>>>> Caused by: com.cloud.utils.exception.ExecutionException: Unable to
>>>>>>>> start
>>>>>>>> VM[DomainRouter|r-504-VM] due to error in finalizeStart, not
>>>>>>>> retrying
>>>>>>>>        at
>>>>>>>> com.cloud.vm.VirtualMachineManagerImpl.orchestrateStart(VirtualMach
>>>>>>>> ineManagerImpl.java:1085)
>>>>>>>>        at
>>>>>>>> com.cloud.vm.VirtualMachineManagerImpl.orchestrateStart(VirtualMach
>>>>>>>> ineManagerImpl.java:4580)
>>>>>>>>        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native
>>>>>>>> Method)
>>>>>>>>        ... 18 more
>>>>>>>> 2015-11-24 12:24:43,115 DEBUG [o.a.c.f.j.i.AsyncJobManagerImpl]
>>>>>>>> (API-Job-Executor-1:ctx-760da779 job-5163) Complete async job-5163,
>>>>>>>> jobStatus: FAILED, resultCode: 530, result: org.
>>>>>>>> apache.cloudstack.api.response.ExceptionResponse/null/{"uuidList":[
>>>>>>>> ],"errorcode":530,"errortext":"Resource [Host:1] is unreachable:
>>>>>>>> Host 1: Unable to start instance due to Unable t
>>>>>>>> o start VM[DomainRouter|r-504-VM] due to error in finalizeStart,
>>>>>>>> not
>>>>>>>> retrying"}
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>>

Re: upgrading 4.5.2 -> 4.6.0 virtualrouter upgrade timeout

Reply via email to