Re: OpenBSD softraid can do scrub, hotspare, hotswap? How do rebuild + those 3 really done? (Absence of docs and howtos - ultimate Q!)

Patrick Dohman Sun, 21 Feb 2016 10:30:56 -0800

Another feature to look for is spin down of the dedicated hot spare.

Go Vikings :)
Patrick


> On Feb 21, 2016, at 7:23 AM, Marcus MERIGHI <mcmer-open...@tor.at> wrote:
>
> ti...@openmailbox.org (Tinker), 2016.02.20 (Sat) 21:05 (CET):
>> So glad to understand better what's in the box.
>>
>> Also please note that I'm not trying to suggest to implement lots of
>> crap, am perfectly clear that high security is correlated with low
>> complexity.
>>
>> On 2016-02-21 00:29, Marcus MERIGHI wrote:
>>> ti...@openmailbox.org (Tinker), 2016.02.20 (Sat) 16:43 (CET):
>> ..
>>> You appear to mean bioctl(8). Thats the only place I could find the word
>>> 'patrol'. bioctl(8) can control more than softraid(4) devices.
>>>
>>> bio(4):
>>>    The following device drivers register with bio for volume
>>>       management:
>>>
>>>          ami(4)         American Megatrends Inc. MegaRAID
>>>                         PATA/SATA/SCSI RAID controller
>>>          arc(4)         Areca Technology Corporation SAS/SATA RAID
>>>                         controller
>>>          cac(4)         Compaq Smart Array 2/3/4 SCSI RAID controller
>>>          ciss(4)        Compaq Smart Array SAS/SATA/SCSI RAID
>>>                               controller
>>>          ips(4)         IBM SATA/SCSI ServeRAID controller
>>>          mfi(4)         LSI Logic & Dell MegaRAID SAS RAID controller
>>>          mpi(4)         LSI Logic Fusion-MPT Message Passing Interface
>>>          mpii(4)        LSI Logic Fusion-MPT Message Passing Interface
>>>                               II
>>>          softraid(4)    Software RAID
>>>
>>> It is talking about controlling a HW raid controller, in that 'patrol'
>>> paragraph, isn't it?
>>
>> So by this you mean that patrolling is really implemented for
>> softraid??
>
> No, I said the opposite.
>
> I'm sure my english language capabilities are not perfect. But what you
> make of it is really surprising! (And even funny in the cabaret way.)
>
> I'll keep trying. But sooner or later we'll have to take this off list.
> Or to newbies. There you get help from the same people but without
> having your misinterpretations in the 'official' archives for other poor
> souls to find ;-)
>
> http://mailman.theapt.org/listinfo/openbsd-newbies
>
>> (Karel and Constantine don't agree??)
>>
>> So I just do.. "bioctl -t start sdX" wher sdX is the name of my softraid
>> device, and it'll do the "scrub" as in reading through all underlying
>
> bioctl(8) is clear, I think:
>     -t patrol-function
>                  Control the RAID card's patrol functionality, if
>                  supported. patrol-function may be one of:
>
> Why do you think it will work for softraid(4) when it says it does for
> hardware-RAID?
>
> I have a theory: you have some experience with other Operating Systems
> and their built in help system that have led you to not fully read but
> just search/skim for keywords. Do yourself (and me) a favour and read
> them fully. Top to bottom. Take every word as put there thoughtfully,
> not in a hurry. You can find manpage content discussions all over the
> archives. manpages are taken seriously.
>
> Please repeat: bio(4)/bioctl(8) controls RAID devices. These can be in
> hardware or software. Some functions (-a, -b, -H, -t, -u) are only
> useable/usefull when controlling a hardware RAID. The manpage even gives
> direct clues on whether hardware- or software RAID is the topic. First
> synopsis, second synopsis. 'The options for RAID controllers are as
> follows:' (=hardware) 'In addition to the relevant options listed above,
> the options for softraid(4) devices are as follows:' (=software).
> Did you note the 'relevant' part? That word is there on purpose, I
> suppose. It is there to tell you that not all, but the relevant parts of
> the hardware RAID parameters also apply to software RAID (that comes
> below). I would consider '-v' relevant, '-a' ('Control the RAID card's
> alarm functionality, if supported') not.
>
> (Example: what '-a' does for hardware RAID can be done with sensorsd(8)
> for software RAID (=softraid(4)). Once a softraid volume is configured,
> you get 'hw.sensors.softraid0.drive0=online (sd1), OK'.
> Try 'sysctl hw.sensors.softraid0'.)
>
>> physical media to check its internal integrity so for RAID1C that will be
>> data readability and that checksums are correct, and "doas bioctl
> softraid0"
>> will show me the % status, and if I don't get any errors before it goes
> back
>> to normal it means the patrol was successful right?
>
> No idea, never had a hardware RAID controller.
>
>> (And as usual patrol is implemented to have the lowest priority, so it
>> should not interfere extreemely much with ordinary SSD softraid
operation.)
>
> I think the patrolling is done by the hardware RAID controller.
> bioctl(8) just commands it to do so.
>
>>>> * Rebuild - I think I saw some console dump of the status of a rebuild
>>>> process on the net, so MAYBE or NO..?
>>>
>>> That's what it looks like:
>>>
>>> $ doas bioctl softraid0
>>> Volume      Status               Size Device
>>> softraid0 0 Rebuild    12002360033280 sd6     RAID5 35% done
>>>         0 Rebuild     4000786726912 0:0.0   noencl <sd2a>
>>>         1 Online      4000786726912 0:1.0   noencl <sd3a>
>>>         2 Online      4000786726912 0:2.0   noencl <sd4a>
>>>         3 Online      4000786726912 0:3.0   noencl <sd5a>
>>
>> Yey!!
>>
>> Wait, can you explain to me what I would write instead of "device" and
>> "channel:target[.lun]" in "bioctl -R device" and "bioctl -R
>> channel:target[.lun]", AND what effect those would have?
>
> The above rebuild was started with:
> $ bioctl -R /dev/sd2a sd6
>                      ^^^=RAID volume
>            ^^^^^^^^^=replacement chunk
>
> Sidenote:
> In fact it was started as 'bioctl -R /dev/sd3a sd7'; I did a reboot in
> between, ordering of the disk devices changed but the rebuild continued
> flawlessly.
>
>> Say that my sd0 and sd1 SSD:s run a RAID1C already, can I then make
> softraid
>
> On a 'OpenBSD 5.9 (GENERIC.MP) #1870: Mon Feb  8 17:34:23 MST 2016',
> from snapshots, bioctl(8) says:
>             Valid raidlevels are:
>                0   RAID 0: A striping discipline.
>                1   RAID 1: A mirroring discipline.
>                5   RAID 5: A striping discipline with floating parity
>                            chunk.
>                C   CRYPTO: An encrypting discipline.
>                c   CONCAT: A concatenating discipline.
>
> What is that 'RAID1C' thing your keep talking about?
>
>> extend my RAID1C with my sd2 SSD by "rebuilding" it, as a way to live-copy
>> in all my data to sd2, so this would work as a kind of live attach even if
>> expensive?
>
> If your sd0 or sd1 fails you can replace them in hardware with sd2 or
> have sd2 already plugged and start a rebuild as shown above.
>
> There are no bioctl(8) parameters for modifying an existing volume.
> Just for rebuilding (-R) and failing (-O).
>
>> Does it work for a softraid that's live already?
>
> A softraid(4) disk is 'just another disk'(tm). Nothing special. You can
> growfs(8) and tunefs(8), I suppose. And you can restore from backups
> after you had to do bigger changes than can be done with these utilities.
>
> No parameters that indicate 'modifiy' or 'edit' or 'append'. Just that
> '-l' to list the chunks for creating a volume.
>
> I'd suggest just playing with it. If you have no real disks for that,
> take a look at vnconfig(8) and vnd(4); use two of these as chunks for
> a softraid volume. Warning: I have not tested this (vnd+softraid).
>
> Then try to extend, append, enlarge, shrink, whatever.
>
>>>> * Hotspare - MAYBE, "man softraid" says "Currently there is no
>>>> automated
>>>> mechanism to recover from failed disks.", but that is not so specific
>>>> wording, and I think I read a hint somewhere that there is hotspare
>>>> functionality.
>>>
>>> bioctl(8)
>>>    -H channel:target[.lun]
>>>            If the device at channel:target[.lun] is currently marked
>>>            ``Unused'', promote it to being a ``Hot Spare''.
>>>
>>> That's the only mention of 'hot spare'. And again talking about
>>> controlling a hardware RAID controller, isn't it?
>>>
>>> What is 'not so specific' about 'no' (as in "Currently there is *no*
>>> automated mechanism to recover from failed disks")?
>>
>> Awesome.
>>
>> I guess "bioctl softraid0" will list which hotspares there are currently,
>> and that "-d" will drop a hotspare.
>
> There are no hot spares as seen from bioctl(8). You, the operator, know
> that disk sdXYZ is a hot spare, sitting there plugged but idle. Then,
> when one of your chunks fails, you do bioctl -R sdXYZa sdABC.
> sdXYZa = your "hot spare": just a disk that is already connected to the
>         system but not used
> sdABC  = your softraid volume (your RAID1C, what ever that is)
>
>> The fact that there is hotspare functionality,
>
> How come you think so?
>
>> means that there are cases when softraid will take a disk out of use.
>
> I do not get the connection from a wrong assumption to the above
> statement but:
>
> Yes, there are cases when softraid will take a disk out of use. It looks
> somewhat like this:
>
> $ doas bioctl softraid0
>  Volume      Status               Size Device
>  softraid0 0 Degraded    12002360033280 sd6     RAID5
>            0 Offline     4000786726912 0:0.0
>            1 Online      4000786726912 0:1.0   noencl <sd3a>
>            2 Online      4000786726912 0:2.0   noencl <sd4a>
>            3 Online      4000786726912 0:3.0   noencl <sd5a>
>
>> That will be when that disk reports itself as COMPLETELY out of use ALL BY
>> ITSELF, such as self-detaching itself on the level of the SATA controller
> or
>> reporting failure via some SMART command?
>
> The reasons why *my* softraid RAID5 went degraded is not clear to me but
> it is documented on bugs@. A block could not be read, kernel panic.
> Reboot, rebuild, ...
>
>> A disk just half-breaking with broken sectors and 99% IO slowdown will not
>> cause it to go offline though so I guess I should buy enterprise drives
> with
>> IO access time guarantees then.
>
> Listen to nick@ (Nick Holland). Search for his older posts, too!
>
> Are you, for instance, sure your motherboard and all other parts in the
> way can handle a disk that just spins down and disconnects from the
> SATA/whatever bus?
>
> Or are you going to have to deal with a kernel panic anyways?
>
>>>> * Hotswap - MAYBE, this would depend on if there's rebuild. Only
>>>> disconnect
>>>> ("bioctl -O" I think; "bioctl -d" is to.. unmount or self-destruct a
>>>> softraid?)
>>>
>>> bioctl -O should fail the chunk specified, simulating hardware failure.
>>> After this command you have an 'Offline' chunk in the 'bioctl' output.
>>>
>>> bioctl -d 'detach', not 'destroy'; just as sdX appears when you assamble
>>> a softraid volume, this makes it go away. better unmount before...
>>
>> So "-d" is to take down a whole softraid. "-O" could work to take out a
>> single physical disk but it's unclean.
>
> Please get used to - and use the terms: -d takes the *volume* down. -O
fails
> a *chunk*.
>
>> So then, there is a very unrefined hotswapping functionality in that "-O"
>> can be used to take out a single physical drive, and "-R" (if I understood
>> it correctly above) can be used to plug in a drive.
>
> You are using unusual terms ('take out a single physical drive' vs. 'Set
> the state of device or channel:target[.lun] to offline') but basically
> '-O' => offline, '-R' => rebuild => online.
>
>> Preferable would be to "hotswap" the whole softraid by simply taking it
>> offline altogether ("bioctl -d [raid dev]") and then taking it online
>                                 ^^^^^^^^^^ = volume
>> altogether ("bioctl -c 1 -l [comma-separated devs] softraid0")
>                                               ^^^^ = chunks
>
> What has happened after taking it offline and bringing it back up? Have
> you swapped chunks? Broken one taken out, good replacement shoved in?
> How is bio(4) supposed to know?
>
> What I think should happen:
> - you have a RAID1 volume sd3
> - assembled from the chunks sd1a and sd2a
> - you notice 'something is wrong' e.g. clicking sounds coming from sd2
> - you do 'bioctl -0 sd2a sd3' or, if your hardware allows, just pull out
>  sd2.
> - you replace the failed chunk: either you have a 'hot spare' already
>  plugged in and waiting; or if your hardware allows, you just shove it
>  in; or, as in my case, you shut the system down, replace the disk and
>  restart.
> - In case of reboot your RAID1 volume comes up degraded. In all other
>  cases it just stays degraded. In any case you pray your remaining disk
>  keeps working.
> - the replacement disk shows up as sd4 (for whatever reason, maybe you
>  left the failed one connected)
> - you do all the setup for the new disk (see softraid(4) -> EXAMPLES)
> - you do 'bioctl -R sd4a sd3', rebuild starts.
>
>>>> The man pages are sometimes over-minimalistic with respect to an
>>>> individual
>>>> user who's trying to learn, this is why I'm asking for your
>>>> clarification.
>>>
>>> I am quite sure the man pages are kept as condensed as they are on
>>> purpose.
>>>
>>> You can always read mplayer(1) if you want something lengthy ;-)
>>>
>>>> So your clarifications would still be much appreciated.
>>>
>>> Nothing authoritative from me!
>>> I am just trying to flatten your learning curve.
>> Awesome. Thank you so much!
>
> It's taken me over an hour on a rainy sunday to answer; please use at
> least the same amount of time on investigating before answering.
>
> Bye, Marcus

Re: OpenBSD softraid can do scrub, hotspare, hotswap? How do rebuild + those 3 really done? (Absence of docs and howtos - ultimate Q!)

Reply via email to