Hello everybody, I just wanted to share my experience with a (partially) broken SSD that was in use in a ZIL mirror.
We experienced a dramatic performance problem with one of our zpools, serving home directories. Mainly NFS clients were affected. Our SunRay infrastructure came to a complete halt. Finally we were able to identify one SSD as the root caus. The SSD was still working, but quite slow. The issue didn't trigger ZFS to detect the disk as faulty. FMA didn't detect it, too. We identified the broken disk by issuing "iostat -en'. After replacing the SSD, everything went back to normal. To prevent outages like this in the future I hacked together a "quick and dirty" bash script to detect disks with a given rate of total errors. The script might be used in conjunction with nagios. Perhaps it's of use for others sa well: ################################################################### #!/bin/bash # check disk in all pools for errors. # partially failing (or slow) disks # may result in horribly degradded # performance of zpools despite the fact # the pool is still healthy # exit codes # 0 OK # 1 WARNING # 2 CRITICAL # 3 UNKONOWN OUTPUT="" WARNING="0" CRITICAL="0" SOFTLIMIT="5" HARDLIMIT="20" LIST=$(zpool status | grep "c[1-9].*d0 " | awk '{print $1}') for DISK in $LIST do ERROR=$(iostat -enr $DISK | cut -d "," -f 4 | grep "^[0-9]") if [[ $ERROR -gt $SOFTLIMIT ]] then OUTPUT="$OUTPUT, $DISK:$ERROR" WARNING="1" fi if [[ $ERROR -gt $HARDLIMIT ]] then OUTPUT="$OUTPUT, $DISK:$ERROR" CRITICAL="1" fi done if [[ $CRITICAL -gt 0 ]] then echo "CRITICAL: Disks with error count >= $HARDLIMIT found: $OUTPUT" exit 2 fi if [[ $WARNING -gt 0 ]] then echo "WARNING: Disks with error count >= $SOFTLIMIT found: $OUTPUT" exit 1 fi echo "OK: No significant disk errors found" exit 0 ########################################################################################### cu Carsten _______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss