Re: resume from hibernate fails on amd64 with current snap

guenther Fri, 31 Dec 2021 21:54:21 -0800

On Sun, 26 Dec 2021, Philip Guenther wrote:
> Installed snap from Friday on my X1 extreme and it's no longer able to 
> resume from hibernation, even when hibernation was done right after 
> boot+login, showing 16 "hibernate_block_io open failed" before showing
>       unhibernating @ block 47965181 length 6263601556133MB
> 
>       Unable to resume hibernated image
> 
> That length seems completely bogus, of course.
> 
> If I'm reading my /var/log/daemon and /var/log/messages correctly, my 
> successful resume on Dec 25th was with a kernel I built on Dec 3rd.  :-/


Okay, figured it out: it _was_ caused by the change to not attach various 
devices when unhibernating.  That changed the device under which softraid 
reattached my encrypted boot device from sd3 to sd2:
        Dec 31 14:04:36 bleys /bsd: softraid0 at root
        Dec 31 14:04:36 bleys /bsd: scsibus4 at softraid0: 256 targets
        Dec 31 14:04:36 bleys /bsd: sd2 at scsibus4 targ 1 lun 0: <OPENBSD, SR 
CRYPTO, 006>
        Dec 31 14:04:36 bleys /bsd: sd2: 244197MB, 512 bytes/sector, 500116577 
sectors
        Dec 31 14:04:36 bleys /bsd: softraid0: volume sd2 is roaming, it used 
to be sd3, updating metadata
        Dec 31 14:04:36 bleys /bsd: root on sd2a (8ddcca7f6e4dca69.a) swap on 
sd2b dump on sd2b

The bug is that the hibernate resume logic would read the signature from 
the correct device, but then use the device recorded in that to try to 
read the rest:
        Dec 31 14:04:36 bleys /bsd: hibernate_block_io open failed
        Dec 31 14:04:36 bleys last message repeated 15 times
        Dec 31 14:04:36 bleys /bsd: unhibernating @ block 47965181 length 
6263601556133MB
        Dec 31 14:04:36 bleys /bsd: unhibernating @ block 47965181 length 
6263601556133MB
        Dec 31 14:04:36 bleys /bsd: Unable to resume hibernated image


The fix is a literal one-liner: use the device we read the signature from 
for the entire resume.

Index: kern/subr_hibernate.c
===================================================================
RCS file: /data/src/openbsd/src/sys/kern/subr_hibernate.c,v
retrieving revision 1.129
diff -u -p -r1.129 subr_hibernate.c
--- kern/subr_hibernate.c       31 Aug 2021 14:45:25 -0000      1.129
+++ kern/subr_hibernate.c       1 Jan 2022 05:18:21 -0000
@@ -1173,6 +1173,7 @@ hibernate_resume(void)
                splx(s);
                return;
        }
+       disk_hib.dev = hib.dev;
 
 #ifdef MULTIPROCESSOR
        /* XXX - if we fail later, we may need to rehatch APs on some archs */


Resume works with that.  Well, 'mostly': I've seen a couple "freed pool 
modified" panics during resume, where it's back on the resumed kernel and 
actually drops into ddb.  The second time I at least noted the pool: 
dma32768...which makes me think some device isn't being correctly handled 
after the "don't attach everyone on unhibernate" change.  :-|

I'll try to gather more data, but at least the change above seems clearly 
correct.


Philip Guenther

Re: resume from hibernate fails on amd64 with current snap

Reply via email to