Hi again, I attempted to get the debug messages to print by doing: cd /usr/src/sys/arch/amd64/conf, copying GENERIC.MP to GENERIC.MP.BIODEBUG, and making this change: --- GENERIC.MP Wed Feb 5 23:54:35 2025 +++ GENERIC.MP.BIODEBUG Wed Feb 5 15:38:10 2025 @@ -6,4 +6,6 @@ #option MP_LOCKDEBUG #option WITNESS +option SR_DEBUG + cpu* at mainbus?
Then, I recompiled the kernel and rebooted. I didn't see any debug messages related to softraid, even though my system partitions are using a RAID 1C device. # bioctl -vi softraid0 Volume Status Size Device softraid0 0 Online 1999861775872 sd5 RAID1C 0 Online 1999861775872 0:0.0 noencl <sd2a> 'unknown serial' 1 Online 1999861775872 0:1.0 noencl <sd3a> 'unknown serial' so, I checked out the source in /usr/src/sys/dev. It seems that several of the RAID disciplines have ifdef statements to handle SR_DEBUG, but not the RAID 1C discipline: # grep SR_DEBUG softraid*c |sort |uniq softraid.c:#endif /* SR_DEBUG */ softraid.c:#ifdef SR_DEBUG softraid_crypto.c:#endif /* SR_DEBUG */ softraid_crypto.c:#ifdef SR_DEBUG0 softraid_raid1.c:#ifdef SR_DEBUG softraid_raid5.c:#ifdef SR_DEBUG # grep DEBUG softraid_raid1c.c # So, I'm thinking that I set the option correctly, but perhaps the debugging isn't available for RAID 1C? Also, I got hold of another drive that can copy the entire to original 16GB partition to and run tests against. Is there a procedure where I could copy out the correct byte range to my new drive with dd and try to mount it using the simple CRYPTO discipline (bioctl -c C instead of bioctl -c 1C)? Thank you, --James On Sat, 25 Jan 2025, Stefan Sperling wrote: > Date: Sat, 25 Jan 2025 23:12:01 +0100 > From: Stefan Sperling <s...@stsp.name> > To: James Boyle <jbo...@canonic.net> > Cc: misc@openbsd.org > Subject: Re: softraid, bioctl -c 1C failed array question > > On Fri, Jan 24, 2025 at 02:53:06PM -0500, James Boyle wrote: > > Hello, > > > > I was hoping to get a little help with bioctl and the 1C raid mode after a > > drive failure. The most recent error message I'm getting when trying to > > start the array in a degraded mode is: > > # bioctl -c 1C -l /dev/sd0a softraid0 > > softraid0: RAID 1C requires two or more chunks > > > > Previously, the array had two identical Toshiba 16TB drives as sd0 and > > sd1. The array used partitions sd0a and sd1a. One of those drives, sd1, > > failed before Christmas. I was able to run the degraded array without > > issue. After replacing the failed drive, I kicked off a rebuild using > > bioctl -R. The array came back to the optimal "Online" state. Just a few > > days ago, the second drive of the original pair failed. I was able to > > again start the array with only one working drive (sd0 is the failed > > drive, sd1 is the new drive, sd2 & sd3 are part of another array): > > > > # for X in sd{0,1,2,3,4,5,6} ; do bioctl -v ${X} ; done > > sd0: <ATA, TOSHIBA MG08ACA1, 0102>, serial 71H0A3SWFVGG > > sd1: <ATA, TOSHIBA MG08ACA1, 0103>, serial 44M0A008FVGG > > sd2: <ATA, WDC WD2000F9YZ-0, 01.0>, serial WD-WMC160D3WKSS > > sd3: <ATA, TOSHIBA HDWE150, FP2A>, serial 38EBK7BTF57D > > Volume Status Size Device > > softraid0 0 Online 1999861775872 sd4 RAID1C > > 0 Online 1999861775872 0:0.0 noencl <sd2a> > > 'unknown serial' > > 1 Online 1999861775872 0:1.0 noencl <sd3a> > > 'unknown serial' > > Volume Status Size Device > > softraid0 1 Degraded 16000895729664 sd5 RAID1C > > 0 Offline 16000895729664 1:0.0 noencl <sd0a> > > 'unknown serial' > > 1 Online 16000895729664 1:1.0 noencl <sd1a> > > 'unknown serial' > > > > After that I shut the system down, removed the failed drive. When the > > system started again, what was previously sd1 had been initialized as sd0. > > The other (boot/system) array started fine. I was unable to start the > > degraded array. I got the error messages: > > > > softraid0: trying to bring up sd5 degraded > > softraid0: trying to bring up sd5 degraded > > softraid0: sd5 is offline, will not be brought online > > softraid0: trying to bring up sd5 degraded > > softraid0: trying to bring up sd5 degraded > > softraid0: sd5 is offline, will not be brought online > > softraid0: RAID 1C requires two or more chunks > > softraid0: RAID 1C requires two or more chunks > > > > At one point I put the failed drive back in to see if it could start. I'm > > afraid that may have been the wrong thing to do. > > Before you removed the above sd0 drive, the state of the working drive > (then sd1) was "Online". > > What is the current state of this working drive? Is it still Online now? > It doesn't sound like it is. Maybe it's now also in degrated state, for > example due to a transient write error? > If it is still in Online state then the above errors look like a bug. > > You will not be able to use bioctl to see the current state while the > volume isn't assembled. But there is the SR_DEBUG kernel option. A kernel > compiled with this option enabled should eventually print the state into > dmesg on a line which contains "scm_status". > > The volume state values are defined in sys/dev/biovar.h: > > #define BIOC_SDONLINE 0x00 > #define BIOC_SDONLINE_S "Online" > etc. > > The on-disk meta data structures can be found in sys/dev/softraidvar.h. > > > Is there a way to troubleshoot and restart the array with just the single > > working drive as a degraded array again? > > You'll need at least one chunk in Online state to perform a rebuild and > rescue the array. Otherwise, it seems the only officially supported way > out would be to create a fresh volume and restore the data from backup. > > If your working drive is really still working, it should be possible > to extract the data somehow using raw disk reads to obtain an image of > the filesystem without the softraid meta data headers, and mounting that > image on a vnd(4) device with vnconfig(6) and then copying the files out > to a new array. I've never had to try that myself yet, fortunately. >