Hi

I'm running an application which is using hot plug sata drives as giant 
removable usb keys but bigger and with SATA performance.

I'm using “cfgadm connect” then “configure” then “zpool import” to bring a 
drive on-line and export / unconfigure / disconnect before unplugging. All 
works well.

What I can't guarantee is that one of my users won't one day just yank the 
drive without running the offline sequence. 

In testing for that case I am finding that the system runs fine until a command 
or subsystem tries to write to the drive and then that command and that 
subsystem locks up hard.

The big problem then becomes if I try a zfs or zpool command to attempt 
recovery I then lose zfs/zpool access to all pools in the system and not just 
the damaged one. Specifically - in testing:

Just one single drive with s0 mounted and then yanked:

- zpool status – I have seen either the results show the pool online and no 
errors or a lock up of zpool.
- I can cd into and ls the missing directory but if I try and write anything my 
shell locks up hard
- I try a zfs unmount -f and that locks hard plus I can now no longer run zfs 
anything
- I try a zpool export -f and that locks plus I can now no longer run zpool 
anything
- Even a simple zfs list can lock up zfs commands

Rest of the system continues ticking over but I have now lost access to basic 
admin commands and I can't find a recovery plan short of a reboot.

I've tried "zpool set failmode=continue" with no luck. I tried adding a ZIL, no 
luck.

I can't kill the locked processes.

I'm guessing zfs is waiting for the drive to come back online to safely store 
the write-in-flight - reconnecting the drive makes some of the locked processes 
killable, not all, and running zpool/zfs anything locks up again.

To be clear - the rest of the system working with different data pools keeps 
running fine.

I don't mind data loss on the yanked disk - that would be the user's own stupid 
fault, but I can't accept the risk of locking up zpool/zfs control of the rest 
of the system.

Trying the same tests with a UFS removable disk and the processes are 
interruptible so I could live with zfs internal/ufs removable but it seems to 
be significantly slower plus I was hoping for the integrity benefits of zfs.

Any thoughts on how to stabilise the OS without a reboot?

Thanks

Chris
-- 
This message posted from opensolaris.org
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to