Hello,
Thanks to all who have replied to this thread so far - my apologies for
taking so long to get back to you all.
In terms of where I'm seeing the EXT4 errors, they are showing up in the
kernel log on the node itself, so the output of 'dmesg' is regularly
seeing entries such as these:
[375095.199203] EXT4-fs (ploop43209p1): Remounting filesystem read-only
[375095.199267] EXT4-fs error (device ploop43209p1) in
ext4_ext_remove_space:3073: IO failure
[375095.199400] EXT4-fs error (device ploop43209p1) in ext4_ext_truncate:4692:
IO failure
[375095.199517] EXT4-fs error (device ploop43209p1) in
ext4_reserve_inode_write:5358: Journal has aborted
[375095.199637] EXT4-fs error (device ploop43209p1) in ext4_truncate:4145:
Journal has aborted
[375095.199779] EXT4-fs error (device ploop43209p1) in
ext4_reserve_inode_write:5358: Journal has aborted
[375095.199957] EXT4-fs error (device ploop43209p1) in ext4_orphan_del:2731:
Journal has aborted
[375095.200138] EXT4-fs error (device ploop43209p1) in
ext4_reserve_inode_write:5358: Journal has aborted
[461642.709690] EXT4-fs (ploop43209p1): error count since last fsck: 8
[461642.709702] EXT4-fs (ploop43209p1): initial error at time 1593576601:
ext4_ext_remove_space:3000: inode 136354
[461642.709708] EXT4-fs (ploop43209p1): last error at time 1593576601:
ext4_reserve_inode_write:5358: inode 136354
Inside the container itself, not much is being logged, since the affected
container in in this particular instance is indeed (as per the errors
above) mounted read-only due to the errors its root.hdd filesystem is
experiencing.
Having dug a bit more into what happened here, I suspect that this
corruption may have come about when the containers were being moved either
to or from the standby node and the live node, but I can't be 100% sure of
that.
The picture is further muddied in that the standby node (the node that we
used for evacuating containers from the node to be updated) was itself
initially updated to 7.0.14 (135). However, the live node (which was
updated a short time after the standby node) appears to have got 7.0.14
(136). So I don't know if the issue was in fact with 7.0.14 (135) (which
was on the standby node, where the containers would have been moved to,
and moved back from), or on 7.0.14 (136) on the live node. Were there any
known issues with 7.0.14 (135) that might correlate with what I'm seeing
above ?
Anyway, once again, thanks to everyone who has replied so far. If anyone
has any further questions or would like any further information, please
let me know and I will be happy to assist.
Thank you,
Kevin Drysdale.
On Thu, 2 Jul 2020, Jehan PROCACCIA wrote:
yes , you are right, I do get the same virtuozzo-release as mentioned in the
initial subject, sorry for the noise .
# cat /etc/virtuozzo-release
OpenVZ release 7.0.14 (136)
but anyway, I don't see any ploop / fsck error in the host /var/log/vzctl.log
inside the CT , where did you see those errors ?
Jehan .
_____________________________________________________________________________________________________________________________________________________
De: "jjs - mainphrame" <j...@mainphrame.com>
À: "OpenVZ users" <users@openvz.org>
Envoyé: Jeudi 2 Juillet 2020 19:33:23
Objet: Re: [Users] Issues after updating to 7.0.14 (136)
Thanks for that sanity check, the conundrum is resolved. vzlinux-release and
virtuozzo-release are indeed different things.
Jake
On Thu, Jul 2, 2020 at 10:27 AM Jonathan Wright <jonat...@knownhost.com> wrote:
/etc/redhat-release and /etc/virtuozzo-release are two different things.
On 7/2/20 12:16 PM, jjs - mainphrame wrote:
Jehan -
I get the same output here -
[root@annie ~]# yum repolist |grep virt
virtuozzolinux-base VirtuozzoLinux Base
15,415+189
virtuozzolinux-updates VirtuozzoLinux Updates
0
I'm baffled as to how you're on 7.8.0 while I'm at 7.0,15 even though I'm
fully up to date.
# uname -a
Linux annie.ufcfan.org 3.10.0-1127.8.2.vz7.151.10 #1 SMP Mon Jun 1
19:05:52 MSK 2020 x86_64 x86_64 x86_64 GNU/Linux
Jake
On Thu, Jul 2, 2020 at 10:08 AM Jehan PROCACCIA <jehan.procac...@imtbs-tsp.eu>
wrote:
no factory , just repos virtuozzolinux-base and openvz-os
# yum repolist |grep virt
virtuozzolinux-base VirtuozzoLinux Base 15 415+189
virtuozzolinux-updates VirtuozzoLinux Updates 0
Jehan .
_____________________________________________________________________________________________________________________________________________________
De: "jjs - mainphrame" <j...@mainphrame.com>
À: "OpenVZ users" <users@openvz.org>
Cc: "Kevin Drysdale" <kevin.drysd...@iomart.com>
Envoyé: Jeudi 2 Juillet 2020 18:22:33
Objet: Re: [Users] Issues after updating to 7.0.14 (136)
Jehan, are you running factory?
My ovz hosts are up to date, and I see:
[root@annie ~]# cat /etc/virtuozzo-release
OpenVZ release 7.0.15 (222)
Jake
On Thu, Jul 2, 2020 at 9:08 AM Jehan Procaccia IMT
<jehan.procac...@imtbs-tsp.eu> wrote:
"updating to 7.0.14 (136)" !?
I did an update yesterday , I am far behind that version
# cat /etc/vzlinux-release
Virtuozzo Linux release 7.8.0 (609)
# uname -a
Linux localhost 3.10.0-1127.8.2.vz7.151.14 #1 SMP Tue Jun 9 12:58:54 MSK 2020
x86_64 x86_64 x86_64 GNU/Linux
why don't you try to update to latest version ?
Le 29/06/2020 à 12:30, Kevin Drysdale a écrit :
Hello,
After updating one of our OpenVZ VPS hosting nodes at the end of last
week, we've started to have issues with
corruption apparently occurring inside containers. Issues of this nature
have never affected the node
previously, and there do not appear to be any hardware issues that could
explain this.
Specifically, a few hours after updating, we began to see containers
experiencing errors such as this in the
logs:
[90471.678994] EXT4-fs (ploop35454p1): error count since last fsck: 25
[90471.679022] EXT4-fs (ploop35454p1): initial error at time 1593205255:
ext4_ext_find_extent:904: inode 136399
[90471.679030] EXT4-fs (ploop35454p1): last error at time 1593232922:
ext4_ext_find_extent:904: inode 136399
[95189.954569] EXT4-fs (ploop42983p1): error count since last fsck: 67
[95189.954582] EXT4-fs (ploop42983p1): initial error at time 1593210174:
htree_dirblock_to_tree:918: inode
926441: block 3683060
[95189.954589] EXT4-fs (ploop42983p1): last error at time 1593276902:
ext4_iget:4435: inode 1849777
[95714.207432] EXT4-fs (ploop60706p1): error count since last fsck: 42
[95714.207447] EXT4-fs (ploop60706p1): initial error at time 1593210489:
ext4_ext_find_extent:904: inode 136272
[95714.207452] EXT4-fs (ploop60706p1): last error at time 1593231063:
ext4_ext_find_extent:904: inode 136272
Shutting the containers down and manually mounting and e2fsck'ing their
filesystems did clear these errors, but
each of the containers (which were mostly used for running Plesk) had
widespread issues with corrupt or missing
files after the fsck's completed, necessitating their being restored from
backup.
Concurrently, we also began to see messages like this appearing in
/var/log/vzctl.log, which again have never
appeared at any point prior to this update being installed:
/var/log/vzctl.log:2020-06-26T21:05:19+0100 : Error in fill_hole
(check.c:240): Warning: ploop image
'/vz/private/8288448/root.hdd/root.hds' is sparse
/var/log/vzctl.log:2020-06-26T21:09:41+0100 : Error in fill_hole
(check.c:240): Warning: ploop image
'/vz/private/8288450/root.hdd/root.hds' is sparse
/var/log/vzctl.log:2020-06-26T21:16:22+0100 : Error in fill_hole
(check.c:240): Warning: ploop image
'/vz/private/8288451/root.hdd/root.hds' is sparse
/var/log/vzctl.log:2020-06-26T21:19:57+0100 : Error in fill_hole
(check.c:240): Warning: ploop image
'/vz/private/8288452/root.hdd/root.hds' is sparse
The basic procedure we follow when updating our nodes is as follows:
1, Update the standby node we keep spare for this process
2. vzmigrate all containers from the live node being updated to the
standby node
3. Update the live node
4. Reboot the live node
5. vzmigrate the containers from the standby node back to the live node
they originally came from
So the only tool which has been used to affect these containers is
'vzmigrate' itself, so I'm at something of a
loss as to how to explain the root.hdd images for these containers
containing sparse gaps. This is something we
have never done, as we have always been aware that OpenVZ does not
support their use inside a container's hard
drive image. And the fact that these images have suddenly become sparse
at the same time they have started to
exhibit filesystem corruption is somewhat concerning.
We can restore all affected containers from backups, but I wanted to get
in touch with the list to see if anyone
else at any other site has experienced these or similar issues after
applying the 7.0.14 (136) update.
Thank you,
Kevin Drysdale.
_______________________________________________
Users mailing list
Users@openvz.org
https://lists.openvz.org/mailman/listinfo/users
_______________________________________________
Users mailing list
Users@openvz.org
https://lists.openvz.org/mailman/listinfo/users
_______________________________________________
Users mailing list
Users@openvz.org
https://lists.openvz.org/mailman/listinfo/users
_______________________________________________
Users mailing list
Users@openvz.org
https://lists.openvz.org/mailman/listinfo/users
_______________________________________________
Users mailing list
Users@openvz.org
https://lists.openvz.org/mailman/listinfo/users
--
Jonathan Wright
KnownHost, LLC
https://www.knownhost.com
_______________________________________________
Users mailing list
Users@openvz.org
https://lists.openvz.org/mailman/listinfo/users
_______________________________________________
Users mailing list
Users@openvz.org
https://lists.openvz.org/mailman/listinfo/users
_______________________________________________
Users mailing list
Users@openvz.org
https://lists.openvz.org/mailman/listinfo/users