>>>>> "gc" == Gray Carper <[EMAIL PROTECTED]> writes:
gc> 5. The NAS nead node has wrangled up all six of the iSCSI gc> targets are you using raidz on the head node? It sounds like simple striping, which is probably dangerous with the current code. This kind of sucks because with simple striping you will get the performance of the 6 mega-spindles, while in a raidz you don't just get less storage, you get ~1/6th the seek bandwidth. but that's better than losing a whole pool. It's not even fully effective redundancy if resilvering/scrubbing takes 3 days per 1TB, but if it just stops the pool from becoming corrupt and unimportable then it's done its job. how are you backing up that much storage? or is it all emphemeral? It's common to lose a whole pool, so I'd have thought you'd want to, for example, keep home directories on a main pool and a backup pool, but keep only one copy of the backup dumps since in theory they have corresponding originals somewhere else. If you did split your x4500 * 6 into two pools, I wonder how you'd lay out a ``main pool'' and ``backup pool'' such that they'd be unlikely to get corrupt together. make them on disjoint sets of iscsi target nodes? keep the backup pool exported? you could keep backup pool imported so other groups can write their backups there, and spread it across all 6 targets, but declare a recurring noon - 3pm maintenance window for the backup pool, in which you: export, test-import, export, take snapshots of the zvols on the target nodes, import. Normally you would need II to use device-snapshots for corruption protection, but since you have two layers of ZFS you can use this remedial trick without learning how to use AVS. but only while the pool is exported because otherwise there's no way to take all six snapshots at the same instant. or you could just hope. I'm most interested in failure testing. What happens when you reboot nodes or break network connections while there's write activity to the pool? That is nice that the ``service address'' fails over and fails back, and that you've somehow extended the heartbeat all the way from target to head node, but does this actually work well with iSCSI? Does the iSCSI TCP circuit get broken and remade when the address moves, and does this cause errors or even cause corruption if it happens while writing to the pool? How about something more drastic like rebooting the x4500's---does the head node patiently wait and then continue where it left off like clients are supposed to when their NFS server reboots, or does it panic, or does it freeze for a couple minutes, mark the target down and continue, and then throw a bunch of CKSUM errors when the target comes back? The last one is what happens for me, but I have a mirrorz vdev on the head node so my setup's different. If you can get this setup to work sanely in error scenarios, I think it can potentially have an availability advantage because some of the driver problems causing freezes and kernel panics and hung ports we've seen won't hurt you---you can just reboot the whole target node, so shitty drivers become merely irritating to the sysadmin instead of an availability problem. but my expectation is, you can't. It sounds really scary to me, to be honest, like: 200 eggs, one basket. and the basket is made of duct tape. i'm less interested in performance. I can think of a bunch of silly performance-test questions but I found most interesting Archie's experience about how performance can influence effective reliability. Here are the silly questions: have you tried any other layouts? like exporting individual disks with iSCSI? My intuition is that this would not work well because of TCP congestion, and I also worry the iscsi target would freeze the whole box when one drive failed, a behavior which could be statistically significant to the overall system's reliability when there are so many drives involved. but I wonder. also a simple SVM stripe, or maybe two or three stripes per box, might be faster by avoiding zvol COW. (also, know that Linux has an iSCSI target, too. actually it has three right now: IET, scst, and stgt) any end-to-end testing yet? how is the performance of NFS or CIFS or...what are you hoping to use over the infiniband again, comstar/iSER or is it just IP+NFS? i don't know much about IB. are there fast disks in the head node that you could use to expermient with slog or l2arc? since slogs can't be removed without destroying the pool, you might want testing of NFS+slog/NFS-slog before the pool has real data on it. can you try with and without RED on the switches? i've always wondered if this makes a difference but not bothered to check it because my targets are too slow.
pgpuuTkgmbffn.pgp
Description: PGP signature
_______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss