[jira] [Updated] (CLOUDSTACK-10397) Transient NFS access issues should not result in duplicate VMs or KVM hosts resets

Jean-Francois Nadeau (JIRA) Wed, 24 Oct 2018 03:11:55 -0700


     [ 
https://issues.apache.org/jira/browse/CLOUDSTACK-10397?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Jean-Francois Nadeau updated CLOUDSTACK-10397:
----------------------------------------------
    Description: 
Under CentOS 7.x with KVM and NFS as primary storage,  we expect to tolerate 
and recover from temporary disconnection from primary storage.  We simulate 
this with iptables from the KVM host using a DROP rule in the input and output 
chains to the NFS servers IP. 

 

The observation under 4.11.2 is that an NFS  disconnection of more than 5 
minutes will

With VM HA enabled and host HA disabled:   Cloudstack agent will often block 
refreshing primary storage and go in Down state from the controller 
perspective.  Controller will restart VMs on other hosts creating duplicate VMs 
on the network and possibly corrupt VM root disk if the transient issue goes 
away and the first KVM host still active.

 

With VM HA enabled and host HA enabled: Same agent issue can cause it to block 
and will end in either Disconnect or Down state.  Host HA framework will reset 
the KVM hosts after the kvm.ha._degraded_._max_.period .  The problem here is 
that,  yes the host HA does ensure we don't have dup VMs but at scale this 
would also provoke a lot of KVM host resets (if not all of them). 

 

On 4.9.3 the cloudstack agent will simply "hang" in there and the controller 
would not see the KVM host down (at least for 60 minutes).  When the network 
issue blocking NFS  access is resolved all KVM hosts and VMs just resume 
working with no large scale fencing happening.

The same resilience is expected on 4.11.x .  This a a blocker for an upgrade 
from 4.9,  considering we are more at risk on 4.11 with VM HA enabled and 
regardless of if host HA is enabled.

  was:
Under CentOS 7.x with KVM and NFS as primary storage,  we expect to tolerate 
and recover from temporary disconnection from primary storage.  We simulate 
this with iptables from the KVM host using a DROP rule in the input and output 
chains to the NFS servers IP. 

 

The observation under 4.11.2 is that an NFS  disconnection of more than 5 
minutes will

With VM HA enabled and host HA disabled:   Cloudstack agent will often block 
refreshing primary storage and go in Down state from the controller 
perspective.  Controller will restart VMs on other hosts creating duplicate VMs 
on the network and possibly corrupt VM root disk if the transient issue goes 
away.

 

With VM HA enabled and host HA disabled: Same agent issue can cause it to block 
and will end in either Disconnect or Down state.  Host HA framework will reset 
the KVM hosts after the kvm.ha._degraded_._max_.period .  The problem here is 
that,  yes the host HA does ensure we don't have dup VMs but at scale this 
would also provoke a lot of KVM host resets (if not all of them). 

 

On 4.9.3 the cloudstack agent will simply "hang" in there and the controller 
would not see the KVM host down (at least for 60 minutes).  When the network 
issue blocking NFS  access is resolved all KVM hosts and VMs just resume 
working with no large scale fencing happening.

The same resilience is expected on 4.11.x .  This a a blocker for an upgrade 
from 4.9,  considering we are more at risk on 4.11 with VM HA enabled and 
regardless of if host HA is enabled.


> Transient NFS access issues should not result in duplicate VMs or KVM hosts 
> resets
> ----------------------------------------------------------------------------------
>
>                 Key: CLOUDSTACK-10397
>                 URL: https://issues.apache.org/jira/browse/CLOUDSTACK-10397
>             Project: CloudStack
>          Issue Type: Bug
>      Security Level: Public(Anyone can view this level - this is the 
> default.) 
>          Components: cloudstack-agent, Hypervisor Controller
>    Affects Versions: 4.11.1.1
>            Reporter: Jean-Francois Nadeau
>            Priority: Blocker
>
> Under CentOS 7.x with KVM and NFS as primary storage,  we expect to tolerate 
> and recover from temporary disconnection from primary storage.  We simulate 
> this with iptables from the KVM host using a DROP rule in the input and 
> output chains to the NFS servers IP. 
>  
> The observation under 4.11.2 is that an NFS  disconnection of more than 5 
> minutes will
> With VM HA enabled and host HA disabled:   Cloudstack agent will often block 
> refreshing primary storage and go in Down state from the controller 
> perspective.  Controller will restart VMs on other hosts creating duplicate 
> VMs on the network and possibly corrupt VM root disk if the transient issue 
> goes away and the first KVM host still active.
>  
> With VM HA enabled and host HA enabled: Same agent issue can cause it to 
> block and will end in either Disconnect or Down state.  Host HA framework 
> will reset the KVM hosts after the kvm.ha._degraded_._max_.period .  The 
> problem here is that,  yes the host HA does ensure we don't have dup VMs but 
> at scale this would also provoke a lot of KVM host resets (if not all of 
> them). 
>  
> On 4.9.3 the cloudstack agent will simply "hang" in there and the controller 
> would not see the KVM host down (at least for 60 minutes).  When the network 
> issue blocking NFS  access is resolved all KVM hosts and VMs just resume 
> working with no large scale fencing happening.
> The same resilience is expected on 4.11.x .  This a a blocker for an upgrade 
> from 4.9,  considering we are more at risk on 4.11 with VM HA enabled and 
> regardless of if host HA is enabled.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (CLOUDSTACK-10397) Transient NFS access issues should not result in duplicate VMs or KVM hosts resets

Reply via email to