Hi, I think it maybe related to this:
https://bugs.launchpad.net/ubuntu/+source/qemu/+bug/1647389 Thanks On Thu, Nov 23, 2017 at 6:20 PM, Joe Topjian <j...@topjian.net> wrote: > OK, thanks. We'll definitely look at downgrading in a test environment. > > To add some further info to this problem, here are some log entries. When > an instance fails to snapshot or fails to migrate, we see: > > libvirtd[27939]: Cannot start job (modify, none) for domain > instance-00004fe4; current job is (modify, none) owned by (27942 > remoteDispatchDomainBlockJobAbort, 0 <null>) for (69116s, 0s) > > libvirtd[27939]: Cannot start job (none, migration out) for domain > instance-00004fe4; current job is (modify, none) owned by (27942 > remoteDispatchDomainBlockJobAbort, 0 <null>) for (69361s, 0s) > > > The one piece of this that I'm currently fixated on is the length of time > it takes libvirt to start. I'm not sure if it's causing the above, though. > When starting libvirt through systemd, it takes much longer to process the > iptables and ebtables rules than if we start libvirtd on the command-line > directly. > > virFirewallApplyRule:839 : Applying rule '/sbin/ebtables --concurrent -t > nat -L libvirt-J-vnet5' > virFirewallApplyRule:839 : Applying rule '/sbin/ebtables --concurrent -t > nat -L libvirt-P-vnet5' > virFirewallApplyRule:839 : Applying rule '/sbin/ebtables --concurrent -t > nat -F libvirt-J-vnet5' > virFirewallApplyRule:839 : Applying rule '/sbin/ebtables --concurrent -t > nat -X libvirt-J-vnet5' > virFirewallApplyRule:839 : Applying rule '/sbin/ebtables --concurrent -t > nat -F libvirt-P-vnet5' > virFirewallApplyRule:839 : Applying rule '/sbin/ebtables --concurrent -t > nat -X libvirt-P-vnet5' > > We're talking about a difference between 5 minutes and 5 seconds depending > on where libvirt was started. This doesn't seem normal to me. > > In general, is anyone aware of systemd performing restrictions of some > kind on processes which create subprocesses? Or something like that? I've > tried comparing cgroups and the various limits within systemd between my > shell session and the libvirt-bin.service session and can't find anything > immediately noticeable. Maybe it's apparmor? > > Thanks, > Joe > > On Thu, Nov 23, 2017 at 11:03 AM, Chris Sarginson <csarg...@gmail.com> > wrote: > >> I think we may have pinned libvirt-bin as well, (1.3.1), but I can't >> guarantee that, sorry - I would suggest its worth trying pinning both >> initially. >> >> Chris >> >> On Thu, 23 Nov 2017 at 17:42 Joe Topjian <j...@topjian.net> wrote: >> >>> Hi Chris, >>> >>> Thanks - we will definitely look into this. To confirm: did you also >>> downgrade libvirt as well or was it all qemu? >>> >>> Thanks, >>> Joe >>> >>> On Thu, Nov 23, 2017 at 9:16 AM, Chris Sarginson <csarg...@gmail.com> >>> wrote: >>> >>>> We hit the same issue a while back (I suspect), which we seemed to >>>> resolve by pinning QEMU and related packages at the following version (you >>>> might need to hunt down the debs manually): >>>> >>>> 1:2.5+dfsg-5ubuntu10.5 >>>> >>>> I'm certain there's a launchpad bug for Ubuntu qemu regarding this, but >>>> don't have it to hand. >>>> >>>> Hope this helps, >>>> Chris >>>> >>>> On Thu, 23 Nov 2017 at 15:33 Joe Topjian <j...@topjian.net> wrote: >>>> >>>>> Hi all, >>>>> >>>>> We're seeing some strange libvirt issues in an Ubuntu 16.04 >>>>> environment. It's running Mitaka, but I don't think this is a problem with >>>>> OpenStack itself. >>>>> >>>>> We're in the process of upgrading this environment from Ubuntu 14.04 >>>>> with the Mitaka cloud archive to 16.04. Instances are being live migrated >>>>> (NFS share) to a new 16.04 compute node (fresh install), so there's a >>>>> change between libvirt versions (1.2.2 to 1.3.1). The problem we're seeing >>>>> is only happening on the 16.04/1.3.1 nodes. >>>>> >>>>> We're getting occasional reports of instances not able to be >>>>> snapshotted. Upon investigation, the snapshot process quits early with a >>>>> libvirt/qemu lock timeout error. We then see that the instance's xml file >>>>> has disappeared from /etc/libvirt/qemu and must restart libvirt and >>>>> hard-reboot the instance to get things back to a normal state. Trying to >>>>> live-migrate the instance to another node causes the same thing to happen. >>>>> >>>>> However, at some random time, either the snapshot or the migration >>>>> will work without error. I haven't been able to reproduce this issue on my >>>>> own and haven't been able to figure out the root cause by inspecting >>>>> instances reported to me. >>>>> >>>>> One thing that has stood out is the length of time it takes for >>>>> libvirt to start. If I run "/etc/init.d/libvirt-bin start", it takes at >>>>> least 5 minutes before a simple "virsh list" will work. The command will >>>>> hang otherwise. If I increase libvirt's logging level, I can see that >>>>> during this period of time, libvirt is working on iptables and ebtables >>>>> (looks like it's shelling out commands). >>>>> >>>>> But if I run "libvirtd -l" straight on the command line, all of this >>>>> completes within 5 seconds (including all of the shelling out). >>>>> >>>>> My initial thought is that systemd is doing some type of throttling >>>>> between the system and user slice, but I've tried comparing slice >>>>> attributes and, probably due to my lack of understanding of systemd, can't >>>>> find anything to prove this. >>>>> >>>>> Is anyone else running into this problem? Does anyone know what might >>>>> be the cause? >>>>> >>>>> Thanks, >>>>> Joe >>>>> _______________________________________________ >>>>> OpenStack-operators mailing list >>>>> OpenStack-operators@lists.openstack.org >>>>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstac >>>>> k-operators >>>>> >>>> >>> > > _______________________________________________ > OpenStack-operators mailing list > OpenStack-operators@lists.openstack.org > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators > >
_______________________________________________ OpenStack-operators mailing list OpenStack-operators@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators