Could it be faulty ram ? Do you use ECC ram ? Best Regards, Strahil Nikolov
В вторник, 1 декември 2020 г., 06:17:10 Гринуич+2, Vinícius Ferrão via Users <[email protected]> написа: Hi again, I had to shutdown everything because of a power outage in the office. When trying to get the infra up again, even the Engine have corrupted: [ 772.466982] XFS (dm-4): Invalid superblock magic number mount: /var: wrong fs type, bad option, bad superblock on /dev/mapper/ovirt-var, missing codepage or helper program, or other error. [ 772.472885] XFS (dm-3): Mounting V5 Filesystem [ 773.629700] XFS (dm-3): Starting recovery (logdev: internal) [ 773.731104] XFS (dm-3): Metadata CRC error detected at xfs_agfl_read_verify+0xa1/0xf0 [xfs], xfs_agfl block 0xf00003 [ 773.734352] XFS (dm-3): Unmount and run xfs_repair [ 773.736216] XFS (dm-3): First 128 bytes of corrupted metadata buffer: [ 773.738458] 00000000: 23 31 31 35 36 35 35 34 29 00 2d 20 52 65 62 75 #1156554).- Rebu [ 773.741044] 00000010: 69 6c 74 20 66 6f 72 20 68 74 74 70 73 3a 2f 2f ilt for https:// [ 773.743636] 00000020: 66 65 64 6f 72 61 70 72 6f 6a 65 63 74 2e 6f 72 fedoraproject.or [ 773.746191] 00000030: 67 2f 77 69 6b 69 2f 46 65 64 6f 72 61 5f 32 33 g/wiki/Fedora_23 [ 773.748818] 00000040: 5f 4d 61 73 73 5f 52 65 62 75 69 6c 64 00 2d 20 _Mass_Rebuild.- [ 773.751399] 00000050: 44 72 6f 70 20 6f 62 73 6f 6c 65 74 65 20 64 65 Drop obsolete de [ 773.753933] 00000060: 66 61 74 74 72 20 73 74 61 6e 7a 61 73 20 28 23 fattr stanzas (# [ 773.756428] 00000070: 31 30 34 37 30 33 31 29 00 2d 20 49 6e 73 74 61 1047031).- Insta [ 773.758873] XFS (dm-3): metadata I/O error in "xfs_trans_read_buf_map" at daddr 0xf00003 len 1 error 74 [ 773.763756] XFS (dm-3): xfs_do_force_shutdown(0x8) called from line 446 of file fs/xfs/libxfs/xfs_defer.c. Return address = 00000000962bd5ee [ 773.769363] XFS (dm-3): Corruption of in-memory data detected. Shutting down filesystem [ 773.772643] XFS (dm-3): Please unmount the filesystem and rectify the problem(s) [ 773.776079] XFS (dm-3): xfs_imap_to_bp: xfs_trans_read_buf() returned error -5. [ 773.779113] XFS (dm-3): xlog_recover_clear_agi_bucket: failed to clear agi 3. Continuing. [ 773.783039] XFS (dm-3): xfs_imap_to_bp: xfs_trans_read_buf() returned error -5. [ 773.785698] XFS (dm-3): xlog_recover_clear_agi_bucket: failed to clear agi 3. Continuing. [ 773.790023] XFS (dm-3): Ending recovery (logdev: internal) [ 773.792489] XFS (dm-3): Error -5 recovering leftover CoW allocations. mount: /var/log: can't read superblock on /dev/mapper/ovirt-log. mount: /var/log/audit: mount point does not exist. /var seems to be completely trashed. The only time that I’ve seem something like this was faulty hardware. But nothing shows up on logs, as far as I know. After forcing repairs with -L I’ve got other issues: mount -a [ 326.170941] XFS (dm-4): Mounting V5 Filesystem [ 326.404788] XFS (dm-4): Ending clean mount [ 326.415291] XFS (dm-3): Mounting V5 Filesystem [ 326.611673] XFS (dm-3): Ending clean mount [ 326.621705] XFS (dm-2): Mounting V5 Filesystem [ 326.784067] XFS (dm-2): Starting recovery (logdev: internal) [ 326.792083] XFS (dm-2): Metadata CRC error detected at xfs_agi_read_verify+0xc7/0xf0 [xfs], xfs_agi block 0x2 [ 326.794445] XFS (dm-2): Unmount and run xfs_repair [ 326.795557] XFS (dm-2): First 128 bytes of corrupted metadata buffer: [ 326.797055] 00000000: 4d 33 44 34 39 56 00 00 80 00 00 00 f0 cf 00 00 M3D49V.......... [ 326.799685] 00000010: 00 00 00 00 02 00 00 00 23 10 00 00 3d 08 01 08 ........#...=... [ 326.802290] 00000020: 21 27 44 34 39 56 00 00 00 d0 00 00 01 00 00 00 !'D49V.......... [ 326.804748] 00000030: 50 00 00 00 00 00 00 00 23 10 00 00 41 01 08 08 P.......#...A... [ 326.807296] 00000040: 21 27 44 34 39 56 00 00 10 d0 00 00 02 00 00 00 !'D49V.......... [ 326.809883] 00000050: 60 00 00 00 00 00 00 00 23 10 00 00 41 01 08 08 `.......#...A... [ 326.812345] 00000060: 61 2f 44 34 39 56 00 00 00 00 00 00 00 00 00 00 a/D49V.......... [ 326.814831] 00000070: 50 34 00 00 00 00 00 00 23 10 00 00 82 08 08 04 P4......#....... [ 326.817237] XFS (dm-2): metadata I/O error in "xfs_trans_read_buf_map" at daddr 0x2 len 1 error 74 mount: /var/log/audit: mount(2) system call failed: Structure needs cleaning. But after more xfs_repair -L the engine is up… Now I need to scavenge other VMs and do the same thing. That’s it. Thanks all, V. PS: For those interested, there’s a paste of the fixes: https://pastebin.com/jsMguw6j > > On 29 Nov 2020, at 17:03, Strahil Nikolov <[email protected]> wrote: > > > > Damn... > > You are using EFI boot. Does this happen only to EFI machines ? > Did you notice if only EL 8 is affected ? > > Best Regards, > Strahil Nikolov > > > > > > > В неделя, 29 ноември 2020 г., 19:36:09 Гринуич+2, Vinícius Ferrão > <[email protected]> написа: > > > > > > Yes! > > I have a live VM right now that will de dead on a reboot: > > [root@kontainerscomk ~]# cat /etc/*release > NAME="Red Hat Enterprise Linux" > VERSION="8.3 (Ootpa)" > ID="rhel" > ID_LIKE="fedora" > VERSION_ID="8.3" > PLATFORM_ID="platform:el8" > PRETTY_NAME="Red Hat Enterprise Linux 8.3 (Ootpa)" > ANSI_COLOR="0;31" > CPE_NAME="cpe:/o:redhat:enterprise_linux:8.3:GA" > HOME_URL="https://www.redhat.com/" > BUG_REPORT_URL="https://bugzilla.redhat.com/" > > REDHAT_BUGZILLA_PRODUCT="Red Hat Enterprise Linux 8" > REDHAT_BUGZILLA_PRODUCT_VERSION=8.3 > REDHAT_SUPPORT_PRODUCT="Red Hat Enterprise Linux" > REDHAT_SUPPORT_PRODUCT_VERSION="8.3" > Red Hat Enterprise Linux release 8.3 (Ootpa) > Red Hat Enterprise Linux release 8.3 (Ootpa) > > [root@kontainerscomk ~]# sysctl -a | grep dirty > vm.dirty_background_bytes = 0 > vm.dirty_background_ratio = 10 > vm.dirty_bytes = 0 > vm.dirty_expire_centisecs = 3000 > vm.dirty_ratio = 30 > vm.dirty_writeback_centisecs = 500 > vm.dirtytime_expire_seconds = 43200 > > [root@kontainerscomk ~]# xfs_db -r /dev/dm-0 > xfs_db: /dev/dm-0 is not a valid XFS filesystem (unexpected SB magic number > 0xa82a0000) > Use -F to force a read attempt. > [root@kontainerscomk ~]# xfs_db -r /dev/dm-0 -F > xfs_db: /dev/dm-0 is not a valid XFS filesystem (unexpected SB magic number > 0xa82a0000) > xfs_db: size check failed > xfs_db: V1 inodes unsupported. Please try an older xfsprogs. > > [root@kontainerscomk ~]# cat /etc/fstab > # > # /etc/fstab > # Created by anaconda on Thu Nov 19 22:40:39 2020 > # > # Accessible filesystems, by reference, are maintained under '/dev/disk/'. > # See man pages fstab(5), findfs(8), mount(8) and/or blkid(8) for more info. > # > # After editing this file, run 'systemctl daemon-reload' to update systemd > # units generated from this file. > # > /dev/mapper/rhel-root / xfs defaults 0 0 > UUID=ad84d1ea-c9cc-4b22-8338-d1a6b2c7d27e /boot xfs > defaults 0 0 > UUID=4642-2FF6 /boot/efi vfat > umask=0077,shortname=winnt 0 2 > /dev/mapper/rhel-swap none swap defaults 0 0 > > Thanks, > > > -----Original Message----- > From: Strahil Nikolov <[email protected]> > Sent: Sunday, November 29, 2020 2:33 PM > To: Vinícius Ferrão <[email protected]> > Cc: users <[email protected]> > Subject: Re: [ovirt-users] Re: Constantly XFS in memory corruption inside VMs > > Can you check the output on the VM that was affected: > # cat /etc/*release > # sysctl -a | grep dirty > > > Best Regards, > Strahil Nikolov > > > > > > В неделя, 29 ноември 2020 г., 19:07:48 Гринуич+2, Vinícius Ferrão via Users > <[email protected]> написа: > > > > > > Hi Strahil. > > I’m not using barrier options on mount. It’s the default settings from CentOS > install. > > I have some additional findings, there’s a big number of discarded packages > on the switch on the hypervisor interfaces. > > Discards are OK as far as I know, I hope TCP handles this and do the proper > retransmissions, but I ask if this may be related or not. Our storage is over > NFS. My general expertise is with iSCSI and I’ve never seen this kind of > issue with iSCSI, not that I’m aware of. > > In other clusters, I’ve seen a high number of discards with iSCSI on > XenServer 7.2 but there’s no corruption on the VMs there... > > Thanks, > > Sent from my iPhone > > >> On 29 Nov 2020, at 04:00, Strahil Nikolov <[email protected]> wrote: >> >> Are you using "nobarrier" mount options in the VM ? >> >> If yes, can you try to remove the "nobarrrier" option. >> >> >> Best Regards, >> Strahil Nikolov >> >> >> >> >> >> >> В събота, 28 ноември 2020 г., 19:25:48 Гринуич+2, Vinícius Ferrão >> <[email protected]> написа: >> >> >> >> >> >> Hi Strahil, >> >> I moved a running VM to other host, rebooted and no corruption was found. If >> there's any corruption it may be silent corruption... I've cases where the >> VM was new, just installed, run dnf -y update to get the updated packages, >> rebooted, and boom XFS corruption. So perhaps the motion process isn't the >> one to blame. >> >> But, in fact, I remember when moving a VM that it went down during the >> process and when I rebooted it was corrupted. But this may not seems >> related. It perhaps was already in a inconsistent state. >> >> Anyway, here's the mount options: >> >> Host1: >> 192.168.10.14:/mnt/pool0/ovirt/vm on >> /rhev/data-center/mnt/192.168.10.14:_mnt_pool0_ovirt_vm type nfs4 >> (rw,relatime,vers=4.1,rsize=131072,wsize=131072,namlen=255,soft,noshar >> ecache,proto=tcp,timeo=100,retrans=3,sec=sys,clientaddr=192.168.10.1,l >> ocal_lock=none,addr=192.168.10.14) >> >> Host2: >> 192.168.10.14:/mnt/pool0/ovirt/vm on >> /rhev/data-center/mnt/192.168.10.14:_mnt_pool0_ovirt_vm type nfs4 >> (rw,relatime,vers=4.1,rsize=131072,wsize=131072,namlen=255,soft,noshar >> ecache,proto=tcp,timeo=100,retrans=3,sec=sys,clientaddr=192.168.10.1,l >> ocal_lock=none,addr=192.168.10.14) >> >> The options are the default ones. I haven't changed anything when >> configuring this cluster. >> >> Thanks. >> >> >> >> -----Original Message----- >> From: Strahil Nikolov <[email protected]> >> Sent: Saturday, November 28, 2020 1:54 PM >> To: users <[email protected]>; Vinícius Ferrão >> <[email protected]> >> Subject: Re: [ovirt-users] Constantly XFS in memory corruption inside >> VMs >> >> Can you try with a test vm, if this happens after a Virtual Machine >> migration ? >> >> What are your mount options for the storage domain ? >> >> Best Regards, >> Strahil Nikolov >> >> >> >> >> >> >> В събота, 28 ноември 2020 г., 18:25:15 Гринуич+2, Vinícius Ferrão via Users >> <[email protected]> написа: >> >> >> >> >> >> >> >> >> Hello, >> >> >> >> I’m trying to discover why an oVirt 4.4.3 Cluster with two hosts and NFS >> shared storage on TrueNAS 12.0 is constantly getting XFS corruption inside >> the VMs. >> >> >> >> For random reasons VM’s gets corrupted, sometimes halting it or just being >> silent corrupted and after a reboot the system is unable to boot due to >> “corruption of in-memory data detected”. Sometimes the corrupted data are >> “all zeroes”, sometimes there’s data there. In extreme cases the XFS >> superblock 0 get’s corrupted and the system cannot even detect a XFS >> partition anymore since the magic XFS key is corrupted on the first blocks >> of the virtual disk. >> >> >> >> This is happening for a month now. We had to rollback some backups, and I >> don’t trust anymore on the state of the VMs. >> >> >> >> Using xfs_db I can see that some VM’s have corrupted superblocks but the VM >> is up. One in specific, was with sb0 corrupted, so I knew when a reboot >> kicks in the machine will be gone, and that’s exactly what happened. >> >> >> >> Another day I was just installing a new CentOS 8 VM for random reasons, and >> after running dnf -y update and a reboot the VM was corrupted needing XFS >> repair. That was an extreme case. >> >> >> >> So, I’ve looked on the TrueNAS logs, and there’s apparently nothing wrong on >> the system. No errors logged on dmesg, nothing on /var/log/messages and no >> errors on the “zpools”, not even after scrub operations. On the switch, a >> Catalyst 2960X, we’ve been monitoring it and all it’s interfaces. There are >> no “up and down” and zero errors on all interfaces (we have a 4x Port LACP >> on the TrueNAS side and 2x Port LACP on each hosts), everything seems to be >> fine. The only metric that I was unable to get is “dropped packages”, but >> I’m don’t know if this can be an issue or not. >> >> >> >> Finally, on oVirt, I can’t find anything either. I looked on >> /var/log/messages and /var/log/sanlock.log but there’s nothing that I found >> suspicious. >> >> >> >> Is there’s anyone out there experiencing this? Our VM’s are mainly CentOS >> 7/8 with XFS, there’s 3 Windows VM’s that does not seems to be affected, >> everything else is affected. >> >> >> >> Thanks all. >> >> >> >> _______________________________________________ >> Users mailing list -- [email protected] >> To unsubscribe send an email to [email protected] Privacy >> Statement: https://www.ovirt.org/privacy-policy.html >> oVirt Code of Conduct: >> https://www.ovirt.org/community/about/community-guidelines/ >> List Archives: >> https://lists.ovirt.org/archives/list/[email protected]/message/VLYSE7HC >> FNWTWFZZTL2EJHV36OENHUGB/ >> > > _______________________________________________ > Users mailing list -- [email protected] > To unsubscribe send an email to [email protected] Privacy Statement: > https://www.ovirt.org/privacy-policy.html > oVirt Code of Conduct: > https://www.ovirt.org/community/about/community-guidelines/ > List Archives: > https://lists.ovirt.org/archives/list/[email protected]/message/CZ5E55LJMA7Y5XUAIXBH2FMGYSUU27EV/ > > > > _______________________________________________ Users mailing list -- [email protected] To unsubscribe send an email to [email protected] Privacy Statement: https://www.ovirt.org/privacy-policy.html oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/[email protected]/message/CFOSOGER2VD56UO7FNK63FV4CXQTXNHB/ _______________________________________________ Users mailing list -- [email protected] To unsubscribe send an email to [email protected] Privacy Statement: https://www.ovirt.org/privacy-policy.html oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/[email protected]/message/2SSDYEVSJFKNWLLEL5T5RCKHEBF6NAEM/

