>>>>> "gc" == Gray Carper <[EMAIL PROTECTED]> writes:

    gc> 5. The NAS nead node has wrangled up all six of the iSCSI
    gc> targets 

are you using raidz on the head node?  It sounds like simple striping,
which is probably dangerous with the current code.  This kind of sucks
because with simple striping you will get the performance of the 6
mega-spindles, while in a raidz you don't just get less storage, you
get ~1/6th the seek bandwidth.  but that's better than losing a whole
pool.  It's not even fully effective redundancy if
resilvering/scrubbing takes 3 days per 1TB, but if it just stops the
pool from becoming corrupt and unimportable then it's done its job.

how are you backing up that much storage?  or is it all emphemeral?
It's common to lose a whole pool, so I'd have thought you'd want to,
for example, keep home directories on a main pool and a backup pool,
but keep only one copy of the backup dumps since in theory they have
corresponding originals somewhere else.  

If you did split your x4500 * 6 into two pools, I wonder how you'd lay
out a ``main pool'' and ``backup pool'' such that they'd be unlikely
to get corrupt together.  make them on disjoint sets of iscsi target
nodes?  keep the backup pool exported?  

you could keep backup pool imported so other groups can write their
backups there, and spread it across all 6 targets, but declare a
recurring noon - 3pm maintenance window for the backup pool, in which
you: export, test-import, export, take snapshots of the zvols on the
target nodes, import.  Normally you would need II to use
device-snapshots for corruption protection, but since you have two
layers of ZFS you can use this remedial trick without learning how to
use AVS.  but only while the pool is exported because otherwise
there's no way to take all six snapshots at the same instant.  

or you could just hope.

I'm most interested in failure testing.  What happens when you reboot
nodes or break network connections while there's write activity to the
pool?  That is nice that the ``service address'' fails over and fails
back, and that you've somehow extended the heartbeat all the way from
target to head node, but does this actually work well with iSCSI?
Does the iSCSI TCP circuit get broken and remade when the address
moves, and does this cause errors or even cause corruption if it
happens while writing to the pool?  How about something more drastic
like rebooting the x4500's---does the head node patiently wait and
then continue where it left off like clients are supposed to when
their NFS server reboots, or does it panic, or does it freeze for a
couple minutes, mark the target down and continue, and then throw a
bunch of CKSUM errors when the target comes back?  The last one is
what happens for me, but I have a mirrorz vdev on the head node so my
setup's different.

If you can get this setup to work sanely in error scenarios, I think
it can potentially have an availability advantage because some of the
driver problems causing freezes and kernel panics and hung ports we've
seen won't hurt you---you can just reboot the whole target node, so
shitty drivers become merely irritating to the sysadmin instead of an
availability problem.  but my expectation is, you can't.

It sounds really scary to me, to be honest, like: 200 eggs, one basket.
and the basket is made of duct tape.

i'm less interested in performance.  I can think of a bunch of silly
performance-test questions but I found most interesting Archie's
experience about how performance can influence effective reliability.
Here are the silly questions:

  have you tried any other layouts?  like exporting individual disks
  with iSCSI?  My intuition is that this would not work well because
  of TCP congestion, and I also worry the iscsi target would freeze
  the whole box when one drive failed, a behavior which could be
  statistically significant to the overall system's reliability when
  there are so many drives involved.  but I wonder.  also a simple SVM
  stripe, or maybe two or three stripes per box, might be faster by
  avoiding zvol COW.

    (also, know that Linux has an iSCSI target, too.  actually it has
     three right now: IET, scst, and stgt)

  any end-to-end testing yet?  how is the performance of NFS or CIFS
  or...what are you hoping to use over the infiniband again,
  comstar/iSER or is it just IP+NFS?  i don't know much about IB.

  are there fast disks in the head node that you could use to
  expermient with slog or l2arc?  since slogs can't be removed without
  destroying the pool, you might want testing of NFS+slog/NFS-slog
  before the pool has real data on it.

  can you try with and without RED on the switches?  i've always
  wondered if this makes a difference but not bothered to check it
  because my targets are too slow.

Attachment: pgpuuTkgmbffn.pgp
Description: PGP signature

_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to