HAST - detect failure and restore avoiding an outage?

Chad M Stewart Wed, 20 Feb 2013 12:55:10 -0800

I built a 2 node cluster for testing HAST out.  Each node is an older HP server 
with 6 scsi disks.  Each disk is configured as RAID 0 in the raid controller, I 
wanted a JBOD to be presented to FreeBSD 9.1 x86.  I allocated a single disk 
for the OS, and the other 5 disks for HAST.


node2# zpool status
  pool: scsi-san
 state: ONLINE
  scan: scrub repaired 0 in 0h27m with 0 errors on Tue Feb 19 17:38:55 2013
config:

        NAME            STATE     READ WRITE CKSUM
        scsi-san        ONLINE       0     0     0
          raidz1-0      ONLINE       0     0     0
            hast/disk1  ONLINE       0     0     0
            hast/disk2  ONLINE       0     0     0
            hast/disk3  ONLINE       0     0     0
            hast/disk4  ONLINE       0     0     0
            hast/disk5  ONLINE       0     0     0


  pool: zroot
 state: ONLINE
  scan: none requested
config:

        NAME         STATE     READ WRITE CKSUM
        zroot        ONLINE       0     0     0
          gpt/disk0  ONLINE       0     0     0



Yesterday I physically pulled disk2 (from node1) out to simulate a failure.  
ZFS didn't see anything wrong, expected.  hastd did see the problem, expected.  
'hastctl status' didn't show me anything unusual or indicate any problem that I 
could see on either node.  I saw hastd reporting problems in the logs, 
otherwise everything looked fine.  Is there a way to detect a failed disk from 
hastd besides the log?  camcontrol showed the disk had failed and obviously 
I'll be monitoring using it as well.


For recovery I installed a new disk in the same slot.  To protect the data 
reliability the safest way I can think of to recover is to do the following:

1 - node1 - stop the apps
2 - node1 - export pool
3 - node1 - hastctl create disk2
4 - node1 - for D in 1 2 3 4 5; do hastctl role secondary;done
5 - node2 - for D in 1 2 3 4 5; do hastctl role primary;done
6 - node2 - import pool
7 - node2 - start the apps

At step 5 the hastd will start to resynchronize node2:disk2 -> node1:disk2.  
I've been trying to think of a way to re-establish the mirror without having to 
restart/move the pool _and_ not pose additional risk of data loss.

To avoid an application outage I suppose the following would work:

1 - insert new disk in node1
2 - hastctl role init disk2
3 - hastctl create disk2
4 - hastctl role primary disk2

At that point ZFS would have seen a disk failure and then started resilvering 
the pool. No application outage, but now only 4 disks contain the data 
(assuming changing bits on the pool, not static content).  Using the previous 
steps application outage, but a healthy pool is maintained always.

Is there another scenario I'm thinking of where both data health and no 
application outage could be achieved?


Regards,
Chad

_______________________________________________
freebsd-questions@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to "freebsd-questions-unsubscr...@freebsd.org"

HAST - detect failure and restore avoiding an outage?

Reply via email to