[ https://issues.apache.org/jira/browse/CLOUDSTACK-4199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13742042#comment-13742042 ]
venkata swamybabu budumuru edited comment on CLOUDSTACK-4199 at 8/18/13 3:51 AM: --------------------------------------------------------------------------------- I have also seen this issue every time during failover. Mentioned below are step to reproduce: 1. 1 advanced zone with KVM cluster (2 KVM hosts) 2. Create an offering with RVR enabled. id: 15 name: RVR uuid: 4e91c49f-5870-43e1-9865-0a84cd7b72ae unique_name: RVR display_text: RVR nw_rate: NULL mc_rate: 10 traffic_type: Guest tags: NULL system_only: 0 specify_vlan: 0 service_offering_id: NULL conserve_mode: 1 created: 2013-08-16 05:05:34 removed: NULL default: 0 availability: Optional dedicated_lb_service: 1 shared_source_nat_service: 0 sort_key: 0 redundant_router_service: 1 =========> RVR is enabled state: Enabled guest_type: Isolated elastic_ip_service: 0 eip_associate_public_ip: 0 elastic_lb_service: 0 specify_ip_ranges: 0 inline: 0 is_persistent: 1 =====> Persistent is enabled. internal_lb: 0 public_lb: 1 egress_default_policy: 1 concurrent_connections: NULL 15 rows in set (0.00 sec) 3. As a non-ROOT domain user, try to deploy a VM using the above network offering. non-ROOT domain user info : username : dom1User1 password : password domain : dom1 id: 220 name: swamyRVRNetwork uuid: 215f3f85-dca2-45e4-9cab-607654677575 display_text: swamyRVRNetwork traffic_type: Guest broadcast_domain_type: Vlan broadcast_uri: vlan://908 gateway: 10.1.1.1 cidr: 10.1.1.0/24 mode: Dhcp network_offering_id: 15 physical_network_id: 200 data_center_id: 1 guru_name: ExternalGuestNetworkGuru state: Implemented related: 220 domain_id: 2 account_id: 3 dns1: NULL dns2: NULL guru_data: NULL set_fields: 0 acl_type: Account network_domain: cs3auto.advanced reservation_id: c81b7838-db46-4d54-a5ed-4f6261802fb6 guest_type: Isolated restart_required: 0 created: 2013-08-16 07:30:48 removed: NULL specify_ip_ranges: 0 vpc_id: NULL ip6_gateway: NULL ip6_cidr: NULL network_cidr: NULL display_network: 1 network_acl_id: NULL id: 48 name: VM1Swamy uuid: 6bfe2221-74b7-4de6-9b46-ae2f5ea1a661 instance_name: i-3-48-QA state: Running vm_template_id: 202 guest_os_id: 112 private_mac_address: 02:00:68:99:00:03 private_ip_address: 10.1.1.23 pod_id: 1 data_center_id: 1 host_id: 2 last_host_id: 2 proxy_id: NULL proxy_assign_time: NULL vnc_password: WFdUuz6e2W97XHGv7YnHc/8b0BH/HqK3eWpX3zxP97U= ha_enabled: 0 limit_cpu_use: 0 update_count: 3 update_time: 2013-08-16 07:35:17 created: 2013-08-16 07:33:25 removed: NULL type: User vm_type: User account_id: 3 domain_id: 2 service_offering_id: 2 reservation_id: 3baf28f3-745b-4dad-8fe9-8bab92bec033 hypervisor_type: KVM disk_offering_id: NULL cpu: NULL ram: NULL owner: 3 speed: 1000 host_name: VM1Swamy display_name: VM1Swamy desired_state: NULL dynamically_scalable: 0 display_vm: 1 4. The above steps deployed RVR routers without any issues id: 46 name: r-46-QA =====================================> This became MASTER uuid: d044fae3-316e-4546-b832-ab9e12b074a3 instance_name: r-46-QA state: Stopped vm_template_id: 3 guest_os_id: 15 private_mac_address: 0e:00:a9:fe:01:69 private_ip_address: 169.254.1.105 pod_id: 1 data_center_id: 1 host_id: NULL last_host_id: 3 proxy_id: NULL proxy_assign_time: NULL vnc_password: eMTnIdbchG5GWMGzs5awGTGs4M7LuYjmLBlmCMMBLSw= ha_enabled: 0 limit_cpu_use: 0 update_count: 5 update_time: 2013-08-16 07:41:43 created: 2013-08-16 07:30:48 removed: NULL type: DomainRouter vm_type: DomainRouter account_id: 3 domain_id: 2 service_offering_id: 7 reservation_id: c70dbe54-8f26-40c0-a111-720b77d4a2c1 hypervisor_type: KVM disk_offering_id: NULL cpu: NULL ram: NULL owner: NULL speed: NULL host_name: NULL display_name: NULL desired_state: NULL dynamically_scalable: 0 display_vm: 1 id: 47 name: r-47-QA =====================================> This became BACKUP uuid: 49080daa-3a00-4967-94cf-594b42375e6e instance_name: r-47-QA state: Running vm_template_id: 3 guest_os_id: 15 private_mac_address: 0e:00:a9:fe:03:8d private_ip_address: 169.254.3.141 pod_id: 1 data_center_id: 1 host_id: 2 last_host_id: 2 proxy_id: NULL proxy_assign_time: NULL vnc_password: UZ483zh1Nq/Ydq2mg1/v4I7mRqaSShk6vd6tWx84rQI= ha_enabled: 0 limit_cpu_use: 0 update_count: 7 update_time: 2013-08-16 07:33:24 created: 2013-08-16 07:30:49 removed: NULL type: DomainRouter vm_type: DomainRouter account_id: 3 domain_id: 2 service_offering_id: 7 reservation_id: 4ad79ebb-7c77-43b4-add2-fd3669d94d2f hypervisor_type: KVM disk_offering_id: NULL cpu: NULL ram: NULL owner: NULL speed: NULL host_name: NULL display_name: NULL desired_state: NULL dynamically_scalable: 0 display_vm: 1 5. Stop the MASTER VR from CloudStack Observations: i. MASTER router went into stopped state successfully but, BACKUP router stuck in "FAULT" state forever. Here is the snippet of keepalived.log for FAULT router root@r-47-QA:~# cat /ramdisk/rrouter/keepalived.log To backup called Disable public ip 0 Password server is not running Stopping DNS forwarder and DHCP server: dnsmasq(not running) ... (warning). cache internal: current active connections: 0 connections created: 0 failed: 0 connections updated: 0 failed: 0 connections destroyed: 0 failed: 0 cache external: current active connections: 0 connections created: 0 failed: 0 connections updated: 0 failed: 0 connections destroyed: 0 failed: 0 traffic processed: 0 Bytes 0 Pckts multicast traffic (active device=eth0): 8 Bytes sent 0 Bytes recv 1 Pckts sent 0 Pckts recv 0 Error send 0 Error recv message tracking: 0 Malformed msgs 0 Lost msgs Conntrackd switch to backup done Switch conntrackd mode backup 0 Status: BACKUP To master called ifdown: interface eth2 not configured RTNETLINK answers: File exists Failed to bring up eth2. RTNETLINK answers: No such process Enable public ip returned 2 Fail to enable public ip! Password server is not running Stopping DNS forwarder and DHCP server: dnsmasq(not running) ... (warning). Stopping keepalived: keepalived. Stopping conntrackd. Status: FAULT (RTNETLINK answers: No such process) Attaching the following logs to the bug along with mgmt server db dump. a. mgmt server log b. db dump c. MASTER (before reboot logs) * ifconfig output * ifconfig -a output * /ramdisk/rrouter/keepalived.log * checkrouter.sh output d. BACKUP (before and after failover) * ifconfig output * ifconfig -a output * /ramdisk/rrouter/keepalived.log * checkrouter.sh output was (Author: swamy): I have also seen this issue every time during failover. Mentioned below are step to reproduce: 1. 1 advanced zone with KVM cluster (2 KVM hosts) 2. Create an offering with RVR enabled. id: 15 name: RVR uuid: 4e91c49f-5870-43e1-9865-0a84cd7b72ae unique_name: RVR display_text: RVR nw_rate: NULL mc_rate: 10 traffic_type: Guest tags: NULL system_only: 0 specify_vlan: 0 service_offering_id: NULL conserve_mode: 1 created: 2013-08-16 05:05:34 removed: NULL default: 0 availability: Optional dedicated_lb_service: 1 shared_source_nat_service: 0 sort_key: 0 redundant_router_service: 1 =========> RVR is enabled state: Enabled guest_type: Isolated elastic_ip_service: 0 eip_associate_public_ip: 0 elastic_lb_service: 0 specify_ip_ranges: 0 inline: 0 is_persistent: 1 =====> Persistent is enabled. internal_lb: 0 public_lb: 1 egress_default_policy: 1 concurrent_connections: NULL 15 rows in set (0.00 sec) 3. As a non-ROOT domain user, try to deploy a VM using the above network offering. non-ROOT domain user info : username : dom1User1 password : password domain : dom1 id: 220 name: swamyRVRNetwork uuid: 215f3f85-dca2-45e4-9cab-607654677575 display_text: swamyRVRNetwork traffic_type: Guest broadcast_domain_type: Vlan broadcast_uri: vlan://908 gateway: 10.1.1.1 cidr: 10.1.1.0/24 mode: Dhcp network_offering_id: 15 physical_network_id: 200 data_center_id: 1 guru_name: ExternalGuestNetworkGuru state: Implemented related: 220 domain_id: 2 account_id: 3 dns1: NULL dns2: NULL guru_data: NULL set_fields: 0 acl_type: Account network_domain: cs3auto.advanced reservation_id: c81b7838-db46-4d54-a5ed-4f6261802fb6 guest_type: Isolated restart_required: 0 created: 2013-08-16 07:30:48 removed: NULL specify_ip_ranges: 0 vpc_id: NULL ip6_gateway: NULL ip6_cidr: NULL network_cidr: NULL display_network: 1 network_acl_id: NULL id: 48 name: VM1Swamy uuid: 6bfe2221-74b7-4de6-9b46-ae2f5ea1a661 instance_name: i-3-48-QA state: Running vm_template_id: 202 guest_os_id: 112 private_mac_address: 02:00:68:99:00:03 private_ip_address: 10.1.1.23 pod_id: 1 data_center_id: 1 host_id: 2 last_host_id: 2 proxy_id: NULL proxy_assign_time: NULL vnc_password: WFdUuz6e2W97XHGv7YnHc/8b0BH/HqK3eWpX3zxP97U= ha_enabled: 0 limit_cpu_use: 0 update_count: 3 update_time: 2013-08-16 07:35:17 created: 2013-08-16 07:33:25 removed: NULL type: User vm_type: User account_id: 3 domain_id: 2 service_offering_id: 2 reservation_id: 3baf28f3-745b-4dad-8fe9-8bab92bec033 hypervisor_type: KVM disk_offering_id: NULL cpu: NULL ram: NULL owner: 3 speed: 1000 host_name: VM1Swamy display_name: VM1Swamy desired_state: NULL dynamically_scalable: 0 display_vm: 1 4. The above steps deployed RVR routers without any issues id: 46 name: r-46-QA =====================================> This became MASTER uuid: d044fae3-316e-4546-b832-ab9e12b074a3 instance_name: r-46-QA state: Stopped vm_template_id: 3 guest_os_id: 15 private_mac_address: 0e:00:a9:fe:01:69 private_ip_address: 169.254.1.105 pod_id: 1 data_center_id: 1 host_id: NULL last_host_id: 3 proxy_id: NULL proxy_assign_time: NULL vnc_password: eMTnIdbchG5GWMGzs5awGTGs4M7LuYjmLBlmCMMBLSw= ha_enabled: 0 limit_cpu_use: 0 update_count: 5 update_time: 2013-08-16 07:41:43 created: 2013-08-16 07:30:48 removed: NULL type: DomainRouter vm_type: DomainRouter account_id: 3 domain_id: 2 service_offering_id: 7 reservation_id: c70dbe54-8f26-40c0-a111-720b77d4a2c1 hypervisor_type: KVM disk_offering_id: NULL cpu: NULL ram: NULL owner: NULL speed: NULL host_name: NULL display_name: NULL desired_state: NULL dynamically_scalable: 0 display_vm: 1 id: 47 name: r-47-QA =====================================> This became BACKUP uuid: 49080daa-3a00-4967-94cf-594b42375e6e instance_name: r-47-QA state: Running vm_template_id: 3 guest_os_id: 15 private_mac_address: 0e:00:a9:fe:03:8d private_ip_address: 169.254.3.141 pod_id: 1 data_center_id: 1 host_id: 2 last_host_id: 2 proxy_id: NULL proxy_assign_time: NULL vnc_password: UZ483zh1Nq/Ydq2mg1/v4I7mRqaSShk6vd6tWx84rQI= ha_enabled: 0 limit_cpu_use: 0 update_count: 7 update_time: 2013-08-16 07:33:24 created: 2013-08-16 07:30:49 removed: NULL type: DomainRouter vm_type: DomainRouter account_id: 3 domain_id: 2 service_offering_id: 7 reservation_id: 4ad79ebb-7c77-43b4-add2-fd3669d94d2f hypervisor_type: KVM disk_offering_id: NULL cpu: NULL ram: NULL owner: NULL speed: NULL host_name: NULL display_name: NULL desired_state: NULL dynamically_scalable: 0 display_vm: 1 5. Stop the MASTER VR from CloudStack Observations: i. MASTER router went into stopped state successfully but, BACKUP router stuck in "FAULT" state forever. Here is the snippet of keepalived.log for FAULT router root@r-47-QA:~# cat /ramdisk/rrouter/keepalived.log To backup called Disable public ip 0 Password server is not running Stopping DNS forwarder and DHCP server: dnsmasq(not running) ... (warning). cache internal: current active connections: 0 connections created: 0 failed: 0 connections updated: 0 failed: 0 connections destroyed: 0 failed: 0 cache external: current active connections: 0 connections created: 0 failed: 0 connections updated: 0 failed: 0 connections destroyed: 0 failed: 0 traffic processed: 0 Bytes 0 Pckts multicast traffic (active device=eth0): 8 Bytes sent 0 Bytes recv 1 Pckts sent 0 Pckts recv 0 Error send 0 Error recv message tracking: 0 Malformed msgs 0 Lost msgs Conntrackd switch to backup done Switch conntrackd mode backup 0 Status: BACKUP To master called ifdown: interface eth2 not configured RTNETLINK answers: File exists Failed to bring up eth2. RTNETLINK answers: No such process Enable public ip returned 2 Fail to enable public ip! Password server is not running Stopping DNS forwarder and DHCP server: dnsmasq(not running) ... (warning). Stopping keepalived: keepalived. Stopping conntrackd. Status: FAULT (RTNETLINK answers: No such process) Attaching the following logs to the bug along with mgmt server db dump. - mgmt server log - db dump - MASTER (before reboot logs) * ifconfig output * ifconfig -a output * /ramdisk/rrouter/keepalived.log * checkrouter.sh output - BACKUP (before and after failover) * ifconfig output * ifconfig -a output * /ramdisk/rrouter/keepalived.log * checkrouter.sh output > Redundant Virtual Router - no failover occur > -------------------------------------------- > > Key: CLOUDSTACK-4199 > URL: https://issues.apache.org/jira/browse/CLOUDSTACK-4199 > Project: CloudStack > Issue Type: Bug > Security Level: Public(Anyone can view this level - this is the > default.) > Components: Management Server > Affects Versions: 4.2.0 > Environment: MS ACS 4.2 campo internal build 341 > host XS 6.2 > Reporter: angeline shen > Priority: Critical > Fix For: 4.2.0 > > Attachments: FAULT_logs.tgz, logs.tgz, management-server.log.gz, > MASTER_logs.tgz, Screenshot-CloudPlatform™ - Mozilla Firefox-3.png, > Screenshot-CloudPlatform™ - Mozilla Firefox-4.png > > > 1. create network offering 'egallowrvrnw1' with egress firewall policy : > allow , redundant router > advance zone. create network of this offering. create guest VMs > Verify ssh to VMs. VMs can ping other VMs in this network & reach > internet > 2. RVR MASTER r-37-VM > RVR BACKUP r-38-VM > stop r-37-VM > Result: r-37-VM state becomes UNKNOWN > r-38-VM state becomes FAULT > no failover occur > Cannot ssh to existing VMs > 3. start r-37-VM. > Result: r-37-VM state becomes MASTER > r-38-VM state remains FAULT > VMs can reach other VMs in same network. > VMs cannot reach internet > 4. stop r-37-VM > r-37-VM state becomes UNKNOWN > r-38-VM state becomes FAULT > no failover occur > Cannot ssh to existing VMs > r.VirtualNetworkApplianceManagerImpl] (RouterStatusMonitor-1:null) Found 1 > networks to update RvR status. > 2013-08-08 19:22:44,763 INFO > [network.router.VirtualNetworkApplianceManagerImpl] > (RedundantRouterStatusMonitor-6:null) Redundant virtual router (name: > r-37-VM, id: 37) just switch from MASTER to UNKNOWN > 2013-08-08 19:22:44,768 DEBUG [agent.transport.Request] > (RedundantRouterStatusMonitor-6:null) Seq 1-2062888873: Sending { Cmd , > MgmtId: 7343890761426, via: 1, Ver: v1, Flags: 100011, > [{"com.cloud.agent.api.CheckRouterCommand":{"a > ccessDetails":{"router.ip":"169.254.3.245","router.name":"r-38-VM"},"wait":30}}] > } > 2013-08-08 19:22:44,769 DEBUG [agent.transport.Request] > (RedundantRouterStatusMonitor-6:null) Seq 1-2062888873: Executing: { Cmd , > MgmtId: 7343890761426, via: 1, Ver: v1, Flags: 100011, > [{"com.cloud.agent.api.CheckRouterCommand": > 2013-08-08 19:22:45,116 INFO > [network.router.VirtualNetworkApplianceManagerImpl] > (RedundantRouterStatusMonitor-6:null) Redundant virtual router (name: > r-38-VM, id: 38) just switch from BACKUP to FAULT > 2013-08-08 19:22:45,344 DEBUG [agent.manager.DirectAgentAttache] > (DirectAgent-270:null) Seq 1-2062888874: Response Received: > 2013-08-08 19:22:45,345 DEBUG [agent.transport.Request] > (DirectAgent-270:null) Seq 1-2062888874: Processing: { Ans: , MgmtId: > 7343890761426, via: 1, Ver: v1, Flags: 10, > [{"com.cloud.agent.api.CheckRouterAnswer":{"state":"FAULT"," > isBumped":false,"result":true,"details":"Status: FAULT (RTNETLINK answers: No > such process)&Bumped: NO","wait":0}}] } > 2013-08-08 19:22:45,345 DEBUG [agent.transport.Request] > (RedundantRouterStatusMonitor-6:null) Seq 1-2062888874: Received: { Ans: , > MgmtId: 7343890761426, via: 1, Ver: v1, Flags: 10, { CheckRouterAnswer } } > 2013-08-08 19:22:45,345 DEBUG [agent.manager.AgentManagerImpl] > (RedundantRouterStatusMonitor-6:null) Details from executing class > com.cloud.agent.api.CheckRouterCommand: Status: FAULT (RTNETLINK answers: No > such process)&Bumped: N > O > 2013-08-08 19:22:45,349 INFO > [network.router.VirtualNetworkApplianceManagerImpl] > (RedundantRouterStatusMonitor-6:null) Redundant virtual router (name: > r-38-VM, id: 38) just switch from BACKUP to FAULT > 2013-08-08 19:22:46,781 DEBUG [agent.manager.AgentManagerImpl] > (AgentManager-Handler-13:null) Ping from 2 > 2013-08-08 19:22:47,125 DEBUG [agent.manager.AgentManagerImpl] > (AgentManager-Handler-12:null) Ping from 3 > -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira