>>>>> "hj" == Henrik Johansson <henr...@henkis.net> writes:
hj> I have been operating quite large deployments of SVM/UFS hj> VxFS/VxVM for some years and while you sometimes are forced to hj> do a filesystem check and some files might end up in hj> lost+found I have never lost a whole filesystem. I think in the world we want, even with the other filesystems, the SAN fabric or array controller or disk shelf should be able to reboot without causing any files to show up in lost+found, or requiring anything other than the normal log roll-forward. I bet there are rampant misimplementations. Maybe the whole SAN situation is ubiquitously misthought because filesystem designers build things assuming that whenever anything ``crashes,'' the kernel and their own code will go down too. They invent a clever way to handle a non-SAN cord-yanking, test it, and yup you can yank the cord it works fine. But this isn't the actual way things can fail. In the diagram below the disk loses power, but the host, SAN, and controller don't. I doubt this is too common. Probably I should redo diagrams like this after better understanding the disk commandset and iSCSI tagged commands and stuff, for other parts of the stack rebooting like the SAN or the controller. filesystem initiator SAN controller diskbuffer platter [...earlier writes not shown...] t SYNC ------.. i ---------.. m -----------.. e ------------- write(A) | . write(B) v . write(C) ..------------- ..----------- ..--------- success ------ good. A-C are on the platter. commit ueberblock(D). write(D) -----.. ---------.. write(E) -----.. -----------.. --------.. ------------ [D] write(F) -----.. -----------.. -------.. ------------- [E] write(G) -----.. -----------.. =======POWER FAILURE======== -------.. -------------- poof...[F] gone -----------.. XXXX no ..XXXX disk ..----------- ..------- ERROR(G) <---- ohno! couldn't write G. increment error counter =======POWER RESTORED======== retry write(G) -----.. -------.. SYNC -----.. -----------.. -------.. -------------- [G] -----------.. -------------- write(G) . ..-------------- ..------------ ..------- success ----- good. that means D-G are on the platter. commit ueberblock(H) write(H) <-- DANGER, Will Robinson. Writes D - F were lost in this ``event,'' and the filesystem has no idea. If ===POWER FAILURE=== applied to the filesystem and the disk at the same time, then this problem would not exist---the way we are using SYNC here would be enough to stop H from being written---so power failures for non-SAN setups are safe from this. Also if we treat the disk as bad the moment it says ``write failure'', and the array controller decides ``this disk is bad, forever,'', if, the instant it loses power and times out write F the controller considers its entire contents lost and does not bother reading ANYthing from it until it's been resilvered by other disks in the RAIDset, then we also do not have this problem, so power failures on SVM mirror with no understanding of the overlying filesystem are okay. Using naked UFS or ext3 or whatever over a SAN still has this problem I think. The filesystems are just better at losing some data but not the whole filesystem, compared to ZFS. I think ZFS attempts to be smarter than SVM, and also more broadly ambitious than one power supply all in one box, but is probably not smart enough to finish the job. Rather than just more UFS/VxFS-style robustness I'd like to see the job finished and this SAN write hole closed up. It's important to accept that nothing is broken in this event. It's just a yanked power cord. I won't accept, ``a device failed, and you didn't have enough redundancy, so all bets are off. You must feed ZFS more redundnacy. You expect the impossible.'' No, that argument is bullshit. Losing power unexpectedly is not the same as a device failure---unexpected power loss is part of the overall state diagram of a normal, working storage system. hj> We are currently evaluating if we should begin to implement hj> ZFS in our SAN. I can see great opportunities with ZFS but if hj> we have a higher risk of loosing entire pools Optimistically, the ueberblock rollback will make ZFS like the other filesystems, though maybe faster to recover. If you are tied to stable solaris it'll probably take like a year before you get your hands on it, but so far I think everyone agrees it's promising. I think it's not enough though. If the problem is that a batch of writes were lost, then a trick to recover the pool still won't recover those lost writes, and you promised applications those writes were on the disk. Databases and filesystems inside zvol's could still become corrupt. What this really means, is that using SAN's makes corruption in general more likely. I think we sysadmins should start using some tiny 10-line programs to test the SAN's and figure out what's wrong with them. I think in the end we will need about two things to fix it: * some kind of commit/replay feature in iSCSI and FC initiators. or else the same feature implemented in the filesystems right above them but cooperating with the initiators pretty intimately. Gigabytes of write data could be ``in flight''---we are talking about however much data is between the return of a first SYNCHRONIZE CACHE command and the next one---so it'd be good to arrange that it not be buffered two or three or four times, which may require layer-violating cooperation. I'm all but certain nobody's doing this now. - is it in the initiator? commit/replay in the initiator would mean the initiator issues SYNCHRONIZE CACHE commands for itself, ones not demanded by the filesystem above it, whenever its replay write cache gets too large. I've never heard of that. and I don't think anyone would put up with an iSCSI/FC initiator burning up gigabytes of RAM without an explanation which would mean that I'd hear about it and be worried about tuning it. - is it in the filesystem? Any filesystem designed before SAN's will expect to eventually get a successful return from any SYNCHRONIZE CACHE command it passes to storage. a failed SYNC will happen in the form of someone yanking the cord, so the filesystem code will never see the failure because it won't be executing any longer. UFS and ext3 don't even bother to issue SYNCHRONIZE CACHE at all, much less pay attention to its return value and buffer writes so they can be replayed if it fails, so I doubt they have an exception path for a failed SYNC command. Putting repaly in the filesystem also means, if the iSCSI initiator notices the target bounce, then it MUST warn the layers above that writes were lost, for example by waiting for the next SYNCHRONIZE CACHE command to come along and deliberately returning it failed without consulting the target, even though the LUN would say it succeeded if it were issued. I've never heard of anything like this. * pay some attention to what happens to ZFS when a SAN controller reboots, separately with each 'failmode' setting. To maintain correctness with NFS clients the zpool is serving, or with replicated/tiered database applications where the dbms app is keeping several nodes in sync, ZFS may need a failmode=umount that kills any app with outstanding writes on a failed pool and un-NFS-exports all the pool's filesystems. the existing failmode=panic could probably be verified (and likely have to be fixed) to provide the same level of correctness, but that would not be as good as the umount-and-kill because it'd make HA and zones more antagonistic to each other, by putting many zones at the mercy of the weakest pool on the system, which could even be a USB stick or something. It's the wrong direction to move. I am not sure what failmode=continue and failmode=wait mean now, or what they should mean to fix this problem. It'd be nice if they meant what they claim to be: ``wait: use commit/replay schemes so that no writes are lost even if the SAN controller reboots. apps should be frozen until they can be allowed to continue as if nothing went wrong. continue: fsync() returns -1 immediately for the first data that never made it to disk, and continues returning -1 until all writes issued up to now are on the platter, including writes that had to be replayed because of the reboot. Once fsync() has been called and has returned -1, all write() to that file must also fail because of the barrier. And once your app calls fsync() a second, third, fourth time and finally gets a 0 return from fsync(), it can be sure no data was lost.'' Of course all that seems optimistic beyond ridiculous, even for UFS and VxFS. but if implemented like that, panic and wait should both be safe for SAN outages, and continue we already understand to be unsafe but implemented like this it becomes possible to write a cooperating app, like a database or a user-mode iSCSI target app for example, which is correct. hj> So, what is the opinion, is this an existing problem even when hj> using enterprise arrays? If I understand this correctly there hj> should be no risk of loosing an entire pool if hj> DKIOCFLUSHWRITECACHE is honored by the array? no, the timing diagram I showed explains how I think data might still be lost during a SAN reboot, even for a SAN which respects cache flushes. but all this is pretty speculative for now.
pgpxOhGmGGGBw.pgp
Description: PGP signature
_______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss