Hi all, To my knowledge, we don't use tunneled migrations. This issue is also happening with snapshots, so it's not restricted to just migrations.
I haven't yet tried the apparmor patches that George mentioned. I plan on applying them once I get another report of a problematic instance. Thank you for the suggetions, though :) Joe On Mon, Nov 27, 2017 at 2:10 AM, Tobias Urdin <tobias.ur...@crystone.com> wrote: > Hello, > > The seems to assume tunnelled migrations, the live_migration_flag is > removed in later version but is there in Mitaka. > > Do you have the VIR_MIGRATE_TUNNELLED flag set for > [libvirt]live_migration_flag in nova.conf? > > > Might be a long shot, but I've removed VIR_MIGRATE_TUNNELLED in our clouds > > Best regards > > On 11/26/2017 01:01 PM, Sean Redmond wrote: > > Hi, > > I think it maybe related to this: > > https://bugs.launchpad.net/ubuntu/+source/qemu/+bug/1647389 > > Thanks > > On Thu, Nov 23, 2017 at 6:20 PM, Joe Topjian <j...@topjian.net> wrote: > >> OK, thanks. We'll definitely look at downgrading in a test environment. >> >> To add some further info to this problem, here are some log entries. When >> an instance fails to snapshot or fails to migrate, we see: >> >> libvirtd[27939]: Cannot start job (modify, none) for domain >> instance-00004fe4; current job is (modify, none) owned by (27942 >> remoteDispatchDomainBlockJobAbort, 0 <null>) for (69116s, 0s) >> >> libvirtd[27939]: Cannot start job (none, migration out) for domain >> instance-00004fe4; current job is (modify, none) owned by (27942 >> remoteDispatchDomainBlockJobAbort, 0 <null>) for (69361s, 0s) >> >> >> The one piece of this that I'm currently fixated on is the length of time >> it takes libvirt to start. I'm not sure if it's causing the above, though. >> When starting libvirt through systemd, it takes much longer to process the >> iptables and ebtables rules than if we start libvirtd on the command-line >> directly. >> >> virFirewallApplyRule:839 : Applying rule '/sbin/ebtables --concurrent -t >> nat -L libvirt-J-vnet5' >> virFirewallApplyRule:839 : Applying rule '/sbin/ebtables --concurrent -t >> nat -L libvirt-P-vnet5' >> virFirewallApplyRule:839 : Applying rule '/sbin/ebtables --concurrent -t >> nat -F libvirt-J-vnet5' >> virFirewallApplyRule:839 : Applying rule '/sbin/ebtables --concurrent -t >> nat -X libvirt-J-vnet5' >> virFirewallApplyRule:839 : Applying rule '/sbin/ebtables --concurrent -t >> nat -F libvirt-P-vnet5' >> virFirewallApplyRule:839 : Applying rule '/sbin/ebtables --concurrent -t >> nat -X libvirt-P-vnet5' >> >> We're talking about a difference between 5 minutes and 5 seconds >> depending on where libvirt was started. This doesn't seem normal to me. >> >> In general, is anyone aware of systemd performing restrictions of some >> kind on processes which create subprocesses? Or something like that? I've >> tried comparing cgroups and the various limits within systemd between my >> shell session and the libvirt-bin.service session and can't find anything >> immediately noticeable. Maybe it's apparmor? >> >> Thanks, >> Joe >> >> On Thu, Nov 23, 2017 at 11:03 AM, Chris Sarginson <csarg...@gmail.com> >> wrote: >> >>> I think we may have pinned libvirt-bin as well, (1.3.1), but I can't >>> guarantee that, sorry - I would suggest its worth trying pinning both >>> initially. >>> >>> Chris >>> >>> On Thu, 23 Nov 2017 at 17:42 Joe Topjian <j...@topjian.net> wrote: >>> >>>> Hi Chris, >>>> >>>> Thanks - we will definitely look into this. To confirm: did you also >>>> downgrade libvirt as well or was it all qemu? >>>> >>>> Thanks, >>>> Joe >>>> >>>> On Thu, Nov 23, 2017 at 9:16 AM, Chris Sarginson <csarg...@gmail.com> >>>> wrote: >>>> >>>>> We hit the same issue a while back (I suspect), which we seemed to >>>>> resolve by pinning QEMU and related packages at the following version (you >>>>> might need to hunt down the debs manually): >>>>> >>>>> 1:2.5+dfsg-5ubuntu10.5 >>>>> >>>>> I'm certain there's a launchpad bug for Ubuntu qemu regarding this, >>>>> but don't have it to hand. >>>>> >>>>> Hope this helps, >>>>> Chris >>>>> >>>>> On Thu, 23 Nov 2017 at 15:33 Joe Topjian <j...@topjian.net> wrote: >>>>> >>>>>> Hi all, >>>>>> >>>>>> We're seeing some strange libvirt issues in an Ubuntu 16.04 >>>>>> environment. It's running Mitaka, but I don't think this is a problem >>>>>> with >>>>>> OpenStack itself. >>>>>> >>>>>> We're in the process of upgrading this environment from Ubuntu 14.04 >>>>>> with the Mitaka cloud archive to 16.04. Instances are being live migrated >>>>>> (NFS share) to a new 16.04 compute node (fresh install), so there's a >>>>>> change between libvirt versions (1.2.2 to 1.3.1). The problem we're >>>>>> seeing >>>>>> is only happening on the 16.04/1.3.1 nodes. >>>>>> >>>>>> We're getting occasional reports of instances not able to be >>>>>> snapshotted. Upon investigation, the snapshot process quits early with a >>>>>> libvirt/qemu lock timeout error. We then see that the instance's xml file >>>>>> has disappeared from /etc/libvirt/qemu and must restart libvirt and >>>>>> hard-reboot the instance to get things back to a normal state. Trying to >>>>>> live-migrate the instance to another node causes the same thing to >>>>>> happen. >>>>>> >>>>>> However, at some random time, either the snapshot or the migration >>>>>> will work without error. I haven't been able to reproduce this issue on >>>>>> my >>>>>> own and haven't been able to figure out the root cause by inspecting >>>>>> instances reported to me. >>>>>> >>>>>> One thing that has stood out is the length of time it takes for >>>>>> libvirt to start. If I run "/etc/init.d/libvirt-bin start", it takes at >>>>>> least 5 minutes before a simple "virsh list" will work. The command will >>>>>> hang otherwise. If I increase libvirt's logging level, I can see that >>>>>> during this period of time, libvirt is working on iptables and ebtables >>>>>> (looks like it's shelling out commands). >>>>>> >>>>>> But if I run "libvirtd -l" straight on the command line, all of this >>>>>> completes within 5 seconds (including all of the shelling out). >>>>>> >>>>>> My initial thought is that systemd is doing some type of throttling >>>>>> between the system and user slice, but I've tried comparing slice >>>>>> attributes and, probably due to my lack of understanding of systemd, >>>>>> can't >>>>>> find anything to prove this. >>>>>> >>>>>> Is anyone else running into this problem? Does anyone know what might >>>>>> be the cause? >>>>>> >>>>>> Thanks, >>>>>> Joe >>>>>> _______________________________________________ >>>>>> OpenStack-operators mailing list >>>>>> OpenStack-operators@lists.openstack.org >>>>>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstac >>>>>> k-operators >>>>>> >>>>> >>>> >> >> _______________________________________________ >> OpenStack-operators mailing list >> OpenStack-operators@lists.openstack.org >> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators >> >> > > > _______________________________________________ > OpenStack-operators mailing list > OpenStack-operators@lists.openstack.org > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators > >
_______________________________________________ OpenStack-operators mailing list OpenStack-operators@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators