[ovirt-users] Re: Constantly XFS in memory corruption inside VMs

Strahil Nikolov via Users Tue, 01 Dec 2020 00:15:59 -0800

Could it be faulty ram ?
Do you use ECC ram ?

Best Regards,
Strahil Nikolov







В вторник, 1 декември 2020 г., 06:17:10 Гринуич+2, Vinícius Ferrão via Users 
<[email protected]> написа: 






Hi again,



I had to shutdown everything because of a power outage in the office. When 
trying to get the infra up again, even the Engine have corrupted: 



[  772.466982] XFS (dm-4): Invalid superblock magic number
mount: /var: wrong fs type, bad option, bad superblock on 
/dev/mapper/ovirt-var, missing codepage or helper program, or other error.
[  772.472885] XFS (dm-3): Mounting V5 Filesystem
[  773.629700] XFS (dm-3): Starting recovery (logdev: internal)
[  773.731104] XFS (dm-3): Metadata CRC error detected at 
xfs_agfl_read_verify+0xa1/0xf0 [xfs], xfs_agfl block 0xf00003 
[  773.734352] XFS (dm-3): Unmount and run xfs_repair
[  773.736216] XFS (dm-3): First 128 bytes of corrupted metadata buffer:
[  773.738458] 00000000: 23 31 31 35 36 35 35 34 29 00 2d 20 52 65 62 75  
#1156554).- Rebu
[  773.741044] 00000010: 69 6c 74 20 66 6f 72 20 68 74 74 70 73 3a 2f 2f  ilt 
for https://
[  773.743636] 00000020: 66 65 64 6f 72 61 70 72 6f 6a 65 63 74 2e 6f 72  
fedoraproject.or
[  773.746191] 00000030: 67 2f 77 69 6b 69 2f 46 65 64 6f 72 61 5f 32 33  
g/wiki/Fedora_23
[  773.748818] 00000040: 5f 4d 61 73 73 5f 52 65 62 75 69 6c 64 00 2d 20  
_Mass_Rebuild.- 
[  773.751399] 00000050: 44 72 6f 70 20 6f 62 73 6f 6c 65 74 65 20 64 65  Drop 
obsolete de
[  773.753933] 00000060: 66 61 74 74 72 20 73 74 61 6e 7a 61 73 20 28 23  fattr 
stanzas (#
[  773.756428] 00000070: 31 30 34 37 30 33 31 29 00 2d 20 49 6e 73 74 61  
1047031).- Insta
[  773.758873] XFS (dm-3): metadata I/O error in "xfs_trans_read_buf_map" at 
daddr 0xf00003 len 1 error 74
[  773.763756] XFS (dm-3): xfs_do_force_shutdown(0x8) called from line 446 of 
file fs/xfs/libxfs/xfs_defer.c. Return address = 00000000962bd5ee
[  773.769363] XFS (dm-3): Corruption of in-memory data detected.  Shutting 
down filesystem
[  773.772643] XFS (dm-3): Please unmount the filesystem and rectify the 
problem(s)
[  773.776079] XFS (dm-3): xfs_imap_to_bp: xfs_trans_read_buf() returned error 
-5.
[  773.779113] XFS (dm-3): xlog_recover_clear_agi_bucket: failed to clear agi 
3. Continuing.
[  773.783039] XFS (dm-3): xfs_imap_to_bp: xfs_trans_read_buf() returned error 
-5.
[  773.785698] XFS (dm-3): xlog_recover_clear_agi_bucket: failed to clear agi 
3. Continuing.
[  773.790023] XFS (dm-3): Ending recovery (logdev: internal)
[  773.792489] XFS (dm-3): Error -5 recovering leftover CoW allocations.
mount: /var/log: can't read superblock on /dev/mapper/ovirt-log.
mount: /var/log/audit: mount point does not exist.




/var seems to be completely trashed.




The only time that I’ve seem something like this was faulty hardware. But 
nothing shows up on logs, as far as I know.




After forcing repairs with -L I’ve got other issues:




mount -a
[  326.170941] XFS (dm-4): Mounting V5 Filesystem
[  326.404788] XFS (dm-4): Ending clean mount
[  326.415291] XFS (dm-3): Mounting V5 Filesystem
[  326.611673] XFS (dm-3): Ending clean mount
[  326.621705] XFS (dm-2): Mounting V5 Filesystem
[  326.784067] XFS (dm-2): Starting recovery (logdev: internal)
[  326.792083] XFS (dm-2): Metadata CRC error detected at 
xfs_agi_read_verify+0xc7/0xf0 [xfs], xfs_agi block 0x2 
[  326.794445] XFS (dm-2): Unmount and run xfs_repair
[  326.795557] XFS (dm-2): First 128 bytes of corrupted metadata buffer:
[  326.797055] 00000000: 4d 33 44 34 39 56 00 00 80 00 00 00 f0 cf 00 00  
M3D49V..........
[  326.799685] 00000010: 00 00 00 00 02 00 00 00 23 10 00 00 3d 08 01 08  
........#...=...
[  326.802290] 00000020: 21 27 44 34 39 56 00 00 00 d0 00 00 01 00 00 00  
!'D49V..........
[  326.804748] 00000030: 50 00 00 00 00 00 00 00 23 10 00 00 41 01 08 08  
P.......#...A...
[  326.807296] 00000040: 21 27 44 34 39 56 00 00 10 d0 00 00 02 00 00 00  
!'D49V..........
[  326.809883] 00000050: 60 00 00 00 00 00 00 00 23 10 00 00 41 01 08 08  
`.......#...A...
[  326.812345] 00000060: 61 2f 44 34 39 56 00 00 00 00 00 00 00 00 00 00  
a/D49V..........
[  326.814831] 00000070: 50 34 00 00 00 00 00 00 23 10 00 00 82 08 08 04  
P4......#.......
[  326.817237] XFS (dm-2): metadata I/O error in "xfs_trans_read_buf_map" at 
daddr 0x2 len 1 error 74
mount: /var/log/audit: mount(2) system call failed: Structure needs cleaning.




But after more xfs_repair -L the engine is up…




Now I need to scavenge other VMs and do the same thing.




That’s it.




Thanks all,

V.




PS: For those interested, there’s a paste of the fixes: 
https://pastebin.com/jsMguw6j







>  
> On 29 Nov 2020, at 17:03, Strahil Nikolov <[email protected]> wrote:
> 
> 
>  
> Damn...
> 
> You are using EFI boot. Does this happen only to EFI machines ?
> Did you notice if only EL 8 is affected ?
> 
> Best Regards,
> Strahil Nikolov
> 
> 
> 
> 
> 
> 
> В неделя, 29 ноември 2020 г., 19:36:09 Гринуич+2, Vinícius Ferrão 
> <[email protected]> написа: 
> 
> 
> 
> 
> 
> Yes!
> 
> I have a live VM right now that will de dead on a reboot:
> 
> [root@kontainerscomk ~]# cat /etc/*release
> NAME="Red Hat Enterprise Linux"
> VERSION="8.3 (Ootpa)"
> ID="rhel"
> ID_LIKE="fedora"
> VERSION_ID="8.3"
> PLATFORM_ID="platform:el8"
> PRETTY_NAME="Red Hat Enterprise Linux 8.3 (Ootpa)"
> ANSI_COLOR="0;31"
> CPE_NAME="cpe:/o:redhat:enterprise_linux:8.3:GA"
> HOME_URL="https://www.redhat.com/";
> BUG_REPORT_URL="https://bugzilla.redhat.com/";
> 
> REDHAT_BUGZILLA_PRODUCT="Red Hat Enterprise Linux 8"
> REDHAT_BUGZILLA_PRODUCT_VERSION=8.3
> REDHAT_SUPPORT_PRODUCT="Red Hat Enterprise Linux"
> REDHAT_SUPPORT_PRODUCT_VERSION="8.3"
> Red Hat Enterprise Linux release 8.3 (Ootpa)
> Red Hat Enterprise Linux release 8.3 (Ootpa)
> 
> [root@kontainerscomk ~]# sysctl -a | grep dirty
> vm.dirty_background_bytes = 0
> vm.dirty_background_ratio = 10
> vm.dirty_bytes = 0
> vm.dirty_expire_centisecs = 3000
> vm.dirty_ratio = 30
> vm.dirty_writeback_centisecs = 500
> vm.dirtytime_expire_seconds = 43200
> 
> [root@kontainerscomk ~]# xfs_db -r /dev/dm-0
> xfs_db: /dev/dm-0 is not a valid XFS filesystem (unexpected SB magic number 
> 0xa82a0000)
> Use -F to force a read attempt.
> [root@kontainerscomk ~]# xfs_db -r /dev/dm-0 -F
> xfs_db: /dev/dm-0 is not a valid XFS filesystem (unexpected SB magic number 
> 0xa82a0000)
> xfs_db: size check failed
> xfs_db: V1 inodes unsupported. Please try an older xfsprogs.
> 
> [root@kontainerscomk ~]# cat /etc/fstab
> #
> # /etc/fstab
> # Created by anaconda on Thu Nov 19 22:40:39 2020
> #
> # Accessible filesystems, by reference, are maintained under '/dev/disk/'.
> # See man pages fstab(5), findfs(8), mount(8) and/or blkid(8) for more info.
> #
> # After editing this file, run 'systemctl daemon-reload' to update systemd
> # units generated from this file.
> #
> /dev/mapper/rhel-root  /                      xfs    defaults        0 0
> UUID=ad84d1ea-c9cc-4b22-8338-d1a6b2c7d27e /boot                  xfs    
> defaults        0 0
> UUID=4642-2FF6          /boot/efi              vfat    
> umask=0077,shortname=winnt 0 2
> /dev/mapper/rhel-swap  none                    swap    defaults        0 0
> 
> Thanks,
> 
> 
> -----Original Message-----
> From: Strahil Nikolov <[email protected]> 
> Sent: Sunday, November 29, 2020 2:33 PM
> To: Vinícius Ferrão <[email protected]>
> Cc: users <[email protected]>
> Subject: Re: [ovirt-users] Re: Constantly XFS in memory corruption inside VMs
> 
> Can you check the output on the VM that was affected:
> # cat /etc/*release
> # sysctl -a | grep dirty
> 
> 
> Best Regards,
> Strahil Nikolov
> 
> 
> 
> 
> 
> В неделя, 29 ноември 2020 г., 19:07:48 Гринуич+2, Vinícius Ferrão via Users 
> <[email protected]> написа: 
> 
> 
> 
> 
> 
> Hi Strahil.
> 
> I’m not using barrier options on mount. It’s the default settings from CentOS 
> install.
> 
> I have some additional findings, there’s a big number of discarded packages 
> on the switch on the hypervisor interfaces.
> 
> Discards are OK as far as I know, I hope TCP handles this and do the proper 
> retransmissions, but I ask if this may be related or not. Our storage is over 
> NFS. My general expertise is with iSCSI and I’ve never seen this kind of 
> issue with iSCSI, not that I’m aware of.
> 
> In other clusters, I’ve seen a high number of discards with iSCSI on 
> XenServer 7.2 but there’s no corruption on the VMs there...
> 
> Thanks,
> 
> Sent from my iPhone
> 
> 
>> On 29 Nov 2020, at 04:00, Strahil Nikolov <[email protected]> wrote:
>> 
>> Are you using "nobarrier" mount options in the VM ?
>> 
>> If yes, can you try to remove the "nobarrrier" option.
>> 
>> 
>> Best Regards,
>> Strahil Nikolov
>> 
>> 
>> 
>> 
>> 
>> 
>> В събота, 28 ноември 2020 г., 19:25:48 Гринуич+2, Vinícius Ferrão 
>> <[email protected]> написа: 
>> 
>> 
>> 
>> 
>> 
>> Hi Strahil,
>> 
>> I moved a running VM to other host, rebooted and no corruption was found. If 
>> there's any corruption it may be silent corruption... I've cases where the 
>> VM was new, just installed, run dnf -y update to get the updated packages, 
>> rebooted, and boom XFS corruption. So perhaps the motion process isn't the 
>> one to blame.
>> 
>> But, in fact, I remember when moving a VM that it went down during the 
>> process and when I rebooted it was corrupted. But this may not seems 
>> related. It perhaps was already in a inconsistent state.
>> 
>> Anyway, here's the mount options:
>> 
>> Host1:
>> 192.168.10.14:/mnt/pool0/ovirt/vm on 
>> /rhev/data-center/mnt/192.168.10.14:_mnt_pool0_ovirt_vm type nfs4 
>> (rw,relatime,vers=4.1,rsize=131072,wsize=131072,namlen=255,soft,noshar
>> ecache,proto=tcp,timeo=100,retrans=3,sec=sys,clientaddr=192.168.10.1,l
>> ocal_lock=none,addr=192.168.10.14)
>> 
>> Host2:
>> 192.168.10.14:/mnt/pool0/ovirt/vm on 
>> /rhev/data-center/mnt/192.168.10.14:_mnt_pool0_ovirt_vm type nfs4 
>> (rw,relatime,vers=4.1,rsize=131072,wsize=131072,namlen=255,soft,noshar
>> ecache,proto=tcp,timeo=100,retrans=3,sec=sys,clientaddr=192.168.10.1,l
>> ocal_lock=none,addr=192.168.10.14)
>> 
>> The options are the default ones. I haven't changed anything when 
>> configuring this cluster.
>> 
>> Thanks.
>> 
>> 
>> 
>> -----Original Message-----
>> From: Strahil Nikolov <[email protected]>
>> Sent: Saturday, November 28, 2020 1:54 PM
>> To: users <[email protected]>; Vinícius Ferrão 
>> <[email protected]>
>> Subject: Re: [ovirt-users] Constantly XFS in memory corruption inside 
>> VMs
>> 
>> Can you try with a test vm, if this happens after a Virtual Machine 
>> migration ?
>> 
>> What are your mount options for the storage domain ?
>> 
>> Best Regards,
>> Strahil Nikolov
>> 
>> 
>> 
>> 
>> 
>> 
>> В събота, 28 ноември 2020 г., 18:25:15 Гринуич+2, Vinícius Ferrão via Users 
>> <[email protected]> написа: 
>> 
>> 
>> 
>> 
>> 
>>   
>> 
>> 
>> Hello,
>> 
>>   
>> 
>> I’m trying to discover why an oVirt 4.4.3 Cluster with two hosts and NFS 
>> shared storage on TrueNAS 12.0 is constantly getting XFS corruption inside 
>> the VMs.
>> 
>>   
>> 
>> For random reasons VM’s gets corrupted, sometimes halting it or just being 
>> silent corrupted and after a reboot the system is unable to boot due to 
>> “corruption of in-memory data detected”. Sometimes the corrupted data are 
>> “all zeroes”, sometimes there’s data there. In extreme cases the XFS 
>> superblock 0 get’s corrupted and the system cannot even detect a XFS 
>> partition anymore since the magic XFS key is corrupted on the first blocks 
>> of the virtual disk.
>> 
>>   
>> 
>> This is happening for a month now. We had to rollback some backups, and I 
>> don’t trust anymore on the state of the VMs.
>> 
>>   
>> 
>> Using xfs_db I can see that some VM’s have corrupted superblocks but the VM 
>> is up. One in specific, was with sb0 corrupted, so I knew when a reboot 
>> kicks in the machine will be gone, and that’s exactly what happened.
>> 
>>   
>> 
>> Another day I was just installing a new CentOS 8 VM for random reasons, and 
>> after running dnf -y update and a reboot the VM was corrupted needing XFS 
>> repair. That was an extreme case.
>> 
>>   
>> 
>> So, I’ve looked on the TrueNAS logs, and there’s apparently nothing wrong on 
>> the system. No errors logged on dmesg, nothing on /var/log/messages and no 
>> errors on the “zpools”, not even after scrub operations. On the switch, a 
>> Catalyst 2960X, we’ve been monitoring it and all it’s interfaces. There are 
>> no “up and down” and zero errors on all interfaces (we have a 4x Port LACP 
>> on the TrueNAS side and 2x Port LACP on each hosts), everything seems to be 
>> fine. The only metric that I was unable to get is “dropped packages”, but 
>> I’m don’t know if this can be an issue or not.
>> 
>>   
>> 
>> Finally, on oVirt, I can’t find anything either. I looked on 
>> /var/log/messages and /var/log/sanlock.log but there’s nothing that I found 
>> suspicious.
>> 
>>   
>> 
>> Is there’s anyone out there experiencing this? Our VM’s are mainly CentOS 
>> 7/8 with XFS, there’s 3 Windows VM’s that does not seems to be affected, 
>> everything else is affected.
>> 
>>   
>> 
>> Thanks all.
>> 
>> 
>> 
>> _______________________________________________
>> Users mailing list -- [email protected]
>> To unsubscribe send an email to [email protected] Privacy 
>> Statement: https://www.ovirt.org/privacy-policy.html
>> oVirt Code of Conduct: 
>> https://www.ovirt.org/community/about/community-guidelines/
>> List Archives: 
>> https://lists.ovirt.org/archives/list/[email protected]/message/VLYSE7HC
>> FNWTWFZZTL2EJHV36OENHUGB/
>> 
> 
> _______________________________________________
> Users mailing list -- [email protected]
> To unsubscribe send an email to [email protected] Privacy Statement: 
> https://www.ovirt.org/privacy-policy.html
> oVirt Code of Conduct:  
> https://www.ovirt.org/community/about/community-guidelines/
> List Archives: 
> https://lists.ovirt.org/archives/list/[email protected]/message/CZ5E55LJMA7Y5XUAIXBH2FMGYSUU27EV/
> 
> 
> 
> 





_______________________________________________
Users mailing list -- [email protected]
To unsubscribe send an email to [email protected]
Privacy Statement: https://www.ovirt.org/privacy-policy.html
oVirt Code of Conduct: 
https://www.ovirt.org/community/about/community-guidelines/
List Archives: 
https://lists.ovirt.org/archives/list/[email protected]/message/CFOSOGER2VD56UO7FNK63FV4CXQTXNHB/
_______________________________________________
Users mailing list -- [email protected]
To unsubscribe send an email to [email protected]
Privacy Statement: https://www.ovirt.org/privacy-policy.html
oVirt Code of Conduct: 
https://www.ovirt.org/community/about/community-guidelines/
List Archives: 
https://lists.ovirt.org/archives/list/[email protected]/message/2SSDYEVSJFKNWLLEL5T5RCKHEBF6NAEM/

[ovirt-users] Re: Constantly XFS in memory corruption inside VMs

Reply via email to