On Sun, 26 Dec 2021, Philip Guenther wrote:
> Installed snap from Friday on my X1 extreme and it's no longer able to
> resume from hibernation, even when hibernation was done right after
> boot+login, showing 16 "hibernate_block_io open failed" before showing
> unhibernating @ block 47965181 length 6263601556133MB
>
> Unable to resume hibernated image
>
> That length seems completely bogus, of course.
>
> If I'm reading my /var/log/daemon and /var/log/messages correctly, my
> successful resume on Dec 25th was with a kernel I built on Dec 3rd. :-/
Okay, figured it out: it _was_ caused by the change to not attach various
devices when unhibernating. That changed the device under which softraid
reattached my encrypted boot device from sd3 to sd2:
Dec 31 14:04:36 bleys /bsd: softraid0 at root
Dec 31 14:04:36 bleys /bsd: scsibus4 at softraid0: 256 targets
Dec 31 14:04:36 bleys /bsd: sd2 at scsibus4 targ 1 lun 0: <OPENBSD, SR
CRYPTO, 006>
Dec 31 14:04:36 bleys /bsd: sd2: 244197MB, 512 bytes/sector, 500116577
sectors
Dec 31 14:04:36 bleys /bsd: softraid0: volume sd2 is roaming, it used
to be sd3, updating metadata
Dec 31 14:04:36 bleys /bsd: root on sd2a (8ddcca7f6e4dca69.a) swap on
sd2b dump on sd2b
The bug is that the hibernate resume logic would read the signature from
the correct device, but then use the device recorded in that to try to
read the rest:
Dec 31 14:04:36 bleys /bsd: hibernate_block_io open failed
Dec 31 14:04:36 bleys last message repeated 15 times
Dec 31 14:04:36 bleys /bsd: unhibernating @ block 47965181 length
6263601556133MB
Dec 31 14:04:36 bleys /bsd: unhibernating @ block 47965181 length
6263601556133MB
Dec 31 14:04:36 bleys /bsd: Unable to resume hibernated image
The fix is a literal one-liner: use the device we read the signature from
for the entire resume.
Index: kern/subr_hibernate.c
===================================================================
RCS file: /data/src/openbsd/src/sys/kern/subr_hibernate.c,v
retrieving revision 1.129
diff -u -p -r1.129 subr_hibernate.c
--- kern/subr_hibernate.c 31 Aug 2021 14:45:25 -0000 1.129
+++ kern/subr_hibernate.c 1 Jan 2022 05:18:21 -0000
@@ -1173,6 +1173,7 @@ hibernate_resume(void)
splx(s);
return;
}
+ disk_hib.dev = hib.dev;
#ifdef MULTIPROCESSOR
/* XXX - if we fail later, we may need to rehatch APs on some archs */
Resume works with that. Well, 'mostly': I've seen a couple "freed pool
modified" panics during resume, where it's back on the resumed kernel and
actually drops into ddb. The second time I at least noted the pool:
dma32768...which makes me think some device isn't being correctly handled
after the "don't attach everyone on unhibernate" change. :-|
I'll try to gather more data, but at least the change above seems clearly
correct.
Philip Guenther