Sending another update on this to just the OpenPOWER cluster users. This problem is still unfortunately happening, even on the nodes which have been rebooted. To reboot the nodes, I need to live migrate all of the VMs onto other nodes. Normally this isn't an issue but I noticed yesterday that some instances were failing during the migration. Upon further investigation, I noticed that some of these VMs were running on both the old and new nodes which is not a good thing for file systems. I've already fixed a few VMs that were in this state, but I'm still working through others which still might be in a bad state. I'm doing a force filesystem check in a rescue mode before booting the systems. I'll let you know once I'm done going through all the VMs in case I missed anything. In the meantime, I'm not going to do any more live migrations until I get this resolved.
On the original issue, I unfortunately have not made any progress on narrowing down what is causing the issue. One option is to go ahead with the Queens upgrade to see if the problem persists or not. But I'd feel much better if I got this fixed before we attempted the upgrade. I'll continue looking into this this week. Thanks for your patience. On Tue, May 21, 2019 at 12:16 PM Lance Albertson <la...@osuosl.org> wrote: > All, > > I wanted to send you an update on where we are at on this issue. So far > I've narrowed down the problem to happening when a VM using a private > network is removed causing certain iptable rules on the hypervisor to get > out of order. It only seems to effect inbound connections to the VM as > outbound seems to still work. I haven't been able to easily reproduce the > issue unfortunately which makes it difficult to troubleshoot. I've looked > through the source code and also looked online to see if anyone else had > run into this without success. > > I've rebooted all of the hypervisors on our x86 cluster and two on our ppc > cluster (which was needed for the MDS updates). So far on the nodes that > have been rebooted we haven't seen any issues, but I need to let those run > for a few days to verify that theory. These machines were also due for a > reboot also because of the CentOS 7.5 -> 7.6 upgrade so perhaps it's > related to that. > > At any rate, I've deployed a temporary cronjob on the nodes that haven't > been rebooted which should "fix" the networking issue. I have it set to run > every minute so that the downtime should be minimal. > > I'll send another update as I have one. > > Thanks- > > On Thu, May 16, 2019 at 8:58 AM Lance Albertson <la...@osuosl.org> wrote: > >> All, >> >> Since the upgrade to Pike we've noticed virtual machines suddenly losing >> network connectivity. This issue seems to sometimes fix itself or when we >> restart the neutron-linuxbridge-agent service on the hypervisors. We >> are doing our best to track down why this is happening and how to fix it. >> Since we're not monitoring every host on the cluster, it's difficult for us >> to know when it happens so if you do have a problem with one of your VMs, >> please let us know either via IRC in #osuosl on Freenode, or via a support >> email. >> >> I'll be sending further updates as we have them. >> >> Thanks for your patience! >> >> -- >> Lance Albertson >> Director >> Oregon State University | Open Source Lab >> > > > -- > Lance Albertson > Director > Oregon State University | Open Source Lab > -- Lance Albertson Director Oregon State University | Open Source Lab
_______________________________________________ openpower mailing list openpo...@osuosl.org https://lists.osuosl.org/mailman/listinfo/openpower