We think we've pinned the qemu errors down to a mismatched group ID on a handful of compute nodes.
The slow systemd/libvirt is still unsolved, but at the moment that does not actually be the cause of the qemu errors. On Mon, Nov 27, 2017 at 8:04 AM, Joe Topjian <j...@topjian.net> wrote: > Hi all, > > To my knowledge, we don't use tunneled migrations. This issue is also > happening with snapshots, so it's not restricted to just migrations. > > I haven't yet tried the apparmor patches that George mentioned. I plan on > applying them once I get another report of a problematic instance. > > Thank you for the suggetions, though :) > Joe > > On Mon, Nov 27, 2017 at 2:10 AM, Tobias Urdin <tobias.ur...@crystone.com> > wrote: > >> Hello, >> >> The seems to assume tunnelled migrations, the live_migration_flag is >> removed in later version but is there in Mitaka. >> >> Do you have the VIR_MIGRATE_TUNNELLED flag set for >> [libvirt]live_migration_flag in nova.conf? >> >> >> Might be a long shot, but I've removed VIR_MIGRATE_TUNNELLED in our clouds >> >> Best regards >> >> On 11/26/2017 01:01 PM, Sean Redmond wrote: >> >> Hi, >> >> I think it maybe related to this: >> >> https://bugs.launchpad.net/ubuntu/+source/qemu/+bug/1647389 >> >> Thanks >> >> On Thu, Nov 23, 2017 at 6:20 PM, Joe Topjian <j...@topjian.net> wrote: >> >>> OK, thanks. We'll definitely look at downgrading in a test environment. >>> >>> To add some further info to this problem, here are some log entries. >>> When an instance fails to snapshot or fails to migrate, we see: >>> >>> libvirtd[27939]: Cannot start job (modify, none) for domain >>> instance-00004fe4; current job is (modify, none) owned by (27942 >>> remoteDispatchDomainBlockJobAbort, 0 <null>) for (69116s, 0s) >>> >>> libvirtd[27939]: Cannot start job (none, migration out) for domain >>> instance-00004fe4; current job is (modify, none) owned by (27942 >>> remoteDispatchDomainBlockJobAbort, 0 <null>) for (69361s, 0s) >>> >>> >>> The one piece of this that I'm currently fixated on is the length of >>> time it takes libvirt to start. I'm not sure if it's causing the above, >>> though. When starting libvirt through systemd, it takes much longer to >>> process the iptables and ebtables rules than if we start libvirtd on the >>> command-line directly. >>> >>> virFirewallApplyRule:839 : Applying rule '/sbin/ebtables --concurrent -t >>> nat -L libvirt-J-vnet5' >>> virFirewallApplyRule:839 : Applying rule '/sbin/ebtables --concurrent -t >>> nat -L libvirt-P-vnet5' >>> virFirewallApplyRule:839 : Applying rule '/sbin/ebtables --concurrent -t >>> nat -F libvirt-J-vnet5' >>> virFirewallApplyRule:839 : Applying rule '/sbin/ebtables --concurrent -t >>> nat -X libvirt-J-vnet5' >>> virFirewallApplyRule:839 : Applying rule '/sbin/ebtables --concurrent -t >>> nat -F libvirt-P-vnet5' >>> virFirewallApplyRule:839 : Applying rule '/sbin/ebtables --concurrent -t >>> nat -X libvirt-P-vnet5' >>> >>> We're talking about a difference between 5 minutes and 5 seconds >>> depending on where libvirt was started. This doesn't seem normal to me. >>> >>> In general, is anyone aware of systemd performing restrictions of some >>> kind on processes which create subprocesses? Or something like that? I've >>> tried comparing cgroups and the various limits within systemd between my >>> shell session and the libvirt-bin.service session and can't find anything >>> immediately noticeable. Maybe it's apparmor? >>> >>> Thanks, >>> Joe >>> >>> On Thu, Nov 23, 2017 at 11:03 AM, Chris Sarginson <csarg...@gmail.com> >>> wrote: >>> >>>> I think we may have pinned libvirt-bin as well, (1.3.1), but I can't >>>> guarantee that, sorry - I would suggest its worth trying pinning both >>>> initially. >>>> >>>> Chris >>>> >>>> On Thu, 23 Nov 2017 at 17:42 Joe Topjian <j...@topjian.net> wrote: >>>> >>>>> Hi Chris, >>>>> >>>>> Thanks - we will definitely look into this. To confirm: did you also >>>>> downgrade libvirt as well or was it all qemu? >>>>> >>>>> Thanks, >>>>> Joe >>>>> >>>>> On Thu, Nov 23, 2017 at 9:16 AM, Chris Sarginson <csarg...@gmail.com> >>>>> wrote: >>>>> >>>>>> We hit the same issue a while back (I suspect), which we seemed to >>>>>> resolve by pinning QEMU and related packages at the following version >>>>>> (you >>>>>> might need to hunt down the debs manually): >>>>>> >>>>>> 1:2.5+dfsg-5ubuntu10.5 >>>>>> >>>>>> I'm certain there's a launchpad bug for Ubuntu qemu regarding this, >>>>>> but don't have it to hand. >>>>>> >>>>>> Hope this helps, >>>>>> Chris >>>>>> >>>>>> On Thu, 23 Nov 2017 at 15:33 Joe Topjian <j...@topjian.net> wrote: >>>>>> >>>>>>> Hi all, >>>>>>> >>>>>>> We're seeing some strange libvirt issues in an Ubuntu 16.04 >>>>>>> environment. It's running Mitaka, but I don't think this is a problem >>>>>>> with >>>>>>> OpenStack itself. >>>>>>> >>>>>>> We're in the process of upgrading this environment from Ubuntu 14.04 >>>>>>> with the Mitaka cloud archive to 16.04. Instances are being live >>>>>>> migrated >>>>>>> (NFS share) to a new 16.04 compute node (fresh install), so there's a >>>>>>> change between libvirt versions (1.2.2 to 1.3.1). The problem we're >>>>>>> seeing >>>>>>> is only happening on the 16.04/1.3.1 nodes. >>>>>>> >>>>>>> We're getting occasional reports of instances not able to be >>>>>>> snapshotted. Upon investigation, the snapshot process quits early with a >>>>>>> libvirt/qemu lock timeout error. We then see that the instance's xml >>>>>>> file >>>>>>> has disappeared from /etc/libvirt/qemu and must restart libvirt and >>>>>>> hard-reboot the instance to get things back to a normal state. Trying to >>>>>>> live-migrate the instance to another node causes the same thing to >>>>>>> happen. >>>>>>> >>>>>>> However, at some random time, either the snapshot or the migration >>>>>>> will work without error. I haven't been able to reproduce this issue on >>>>>>> my >>>>>>> own and haven't been able to figure out the root cause by inspecting >>>>>>> instances reported to me. >>>>>>> >>>>>>> One thing that has stood out is the length of time it takes for >>>>>>> libvirt to start. If I run "/etc/init.d/libvirt-bin start", it takes at >>>>>>> least 5 minutes before a simple "virsh list" will work. The command will >>>>>>> hang otherwise. If I increase libvirt's logging level, I can see that >>>>>>> during this period of time, libvirt is working on iptables and ebtables >>>>>>> (looks like it's shelling out commands). >>>>>>> >>>>>>> But if I run "libvirtd -l" straight on the command line, all of this >>>>>>> completes within 5 seconds (including all of the shelling out). >>>>>>> >>>>>>> My initial thought is that systemd is doing some type of throttling >>>>>>> between the system and user slice, but I've tried comparing slice >>>>>>> attributes and, probably due to my lack of understanding of systemd, >>>>>>> can't >>>>>>> find anything to prove this. >>>>>>> >>>>>>> Is anyone else running into this problem? Does anyone know what >>>>>>> might be the cause? >>>>>>> >>>>>>> Thanks, >>>>>>> Joe >>>>>>> _______________________________________________ >>>>>>> OpenStack-operators mailing list >>>>>>> OpenStack-operators@lists.openstack.org >>>>>>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstac >>>>>>> k-operators >>>>>>> >>>>>> >>>>> >>> >>> _______________________________________________ >>> OpenStack-operators mailing list >>> OpenStack-operators@lists.openstack.org >>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators >>> >>> >> >> >> _______________________________________________ >> OpenStack-operators mailing list >> OpenStack-operators@lists.openstack.org >> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators >> >> >
_______________________________________________ OpenStack-operators mailing list OpenStack-operators@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators