Re: [Openstack-operators] mitaka/xenial libvirt issues

Tobias Urdin Mon, 27 Nov 2017 01:13:42 -0800

Hello,

The seems to assume tunnelled migrations, the live_migration_flag is removed in 
later version but is there in Mitaka.


Do you have the VIR_MIGRATE_TUNNELLED flag set for [libvirt]live_migration_flag 
in nova.conf?


Might be a long shot, but I've removed VIR_MIGRATE_TUNNELLED in our clouds

Best regards

On 11/26/2017 01:01 PM, Sean Redmond wrote:
Hi,

I think it maybe related to this:

https://bugs.launchpad.net/ubuntu/+source/qemu/+bug/1647389

Thanks

On Thu, Nov 23, 2017 at 6:20 PM, Joe Topjian 
<[email protected]<mailto:[email protected]>> wrote:
OK, thanks. We'll definitely look at downgrading in a test environment.

To add some further info to this problem, here are some log entries. When an 
instance fails to snapshot or fails to migrate, we see:

libvirtd[27939]: Cannot start job (modify, none) for domain instance-00004fe4; 
current job is (modify, none) owned by (27942 
remoteDispatchDomainBlockJobAbort, 0 <null>) for (69116s, 0s)

libvirtd[27939]: Cannot start job (none, migration out) for domain 
instance-00004fe4; current job is (modify, none) owned by (27942 
remoteDispatchDomainBlockJobAbort, 0 <null>) for (69361s, 0s)


The one piece of this that I'm currently fixated on is the length of time it 
takes libvirt to start. I'm not sure if it's causing the above, though. When 
starting libvirt through systemd, it takes much longer to process the iptables 
and ebtables rules than if we start libvirtd on the command-line directly.

virFirewallApplyRule:839 : Applying rule '/sbin/ebtables --concurrent -t nat -L 
libvirt-J-vnet5'
virFirewallApplyRule:839 : Applying rule '/sbin/ebtables --concurrent -t nat -L 
libvirt-P-vnet5'
virFirewallApplyRule:839 : Applying rule '/sbin/ebtables --concurrent -t nat -F 
libvirt-J-vnet5'
virFirewallApplyRule:839 : Applying rule '/sbin/ebtables --concurrent -t nat -X 
libvirt-J-vnet5'
virFirewallApplyRule:839 : Applying rule '/sbin/ebtables --concurrent -t nat -F 
libvirt-P-vnet5'
virFirewallApplyRule:839 : Applying rule '/sbin/ebtables --concurrent -t nat -X 
libvirt-P-vnet5'

We're talking about a difference between 5 minutes and 5 seconds depending on 
where libvirt was started. This doesn't seem normal to me.

In general, is anyone aware of systemd performing restrictions of some kind on 
processes which create subprocesses? Or something like that? I've tried 
comparing cgroups and the various limits within systemd between my shell 
session and the libvirt-bin.service session and can't find anything immediately 
noticeable. Maybe it's apparmor?

Thanks,
Joe

On Thu, Nov 23, 2017 at 11:03 AM, Chris Sarginson 
<[email protected]<mailto:[email protected]>> wrote:
I think we may have pinned libvirt-bin as well, (1.3.1), but I can't guarantee 
that, sorry - I would suggest its worth trying pinning both initially.

Chris

On Thu, 23 Nov 2017 at 17:42 Joe Topjian 
<[email protected]<mailto:[email protected]>> wrote:
Hi Chris,

Thanks - we will definitely look into this. To confirm: did you also downgrade 
libvirt as well or was it all qemu?

Thanks,
Joe

On Thu, Nov 23, 2017 at 9:16 AM, Chris Sarginson 
<[email protected]<mailto:[email protected]>> wrote:
We hit the same issue a while back (I suspect), which we seemed to resolve by 
pinning QEMU and related packages at the following version (you might need to 
hunt down the debs manually):

1:2.5+dfsg-5ubuntu10.5

I'm certain there's a launchpad bug for Ubuntu qemu regarding this, but don't 
have it to hand.

Hope this helps,
Chris

On Thu, 23 Nov 2017 at 15:33 Joe Topjian 
<[email protected]<mailto:[email protected]>> wrote:
Hi all,

We're seeing some strange libvirt issues in an Ubuntu 16.04 environment. It's 
running Mitaka, but I don't think this is a problem with OpenStack itself.

We're in the process of upgrading this environment from Ubuntu 14.04 with the 
Mitaka cloud archive to 16.04. Instances are being live migrated (NFS share) to 
a new 16.04 compute node (fresh install), so there's a change between libvirt 
versions (1.2.2 to 1.3.1). The problem we're seeing is only happening on the 
16.04/1.3.1 nodes.

We're getting occasional reports of instances not able to be snapshotted. Upon 
investigation, the snapshot process quits early with a libvirt/qemu lock 
timeout error. We then see that the instance's xml file has disappeared from 
/etc/libvirt/qemu and must restart libvirt and hard-reboot the instance to get 
things back to a normal state. Trying to live-migrate the instance to another 
node causes the same thing to happen.

However, at some random time, either the snapshot or the migration will work 
without error. I haven't been able to reproduce this issue on my own and 
haven't been able to figure out the root cause by inspecting instances reported 
to me.

One thing that has stood out is the length of time it takes for libvirt to 
start. If I run "/etc/init.d/libvirt-bin start", it takes at least 5 minutes 
before a simple "virsh list" will work. The command will hang otherwise. If I 
increase libvirt's logging level, I can see that during this period of time, 
libvirt is working on iptables and ebtables (looks like it's shelling out 
commands).

But if I run "libvirtd -l" straight on the command line, all of this completes 
within 5 seconds (including all of the shelling out).

My initial thought is that systemd is doing some type of throttling between the 
system and user slice, but I've tried comparing slice attributes and, probably 
due to my lack of understanding of systemd, can't find anything to prove this.

Is anyone else running into this problem? Does anyone know what might be the 
cause?

Thanks,
Joe
_______________________________________________
OpenStack-operators mailing list
[email protected]<mailto:[email protected]>
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators



_______________________________________________
OpenStack-operators mailing list
[email protected]<mailto:[email protected]>
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators

_______________________________________________
OpenStack-operators mailing list
[email protected]
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators

Re: [Openstack-operators] mitaka/xenial libvirt issues

Reply via email to