Re: [zfs-discuss] Wanted: sanity check for a clustered ZFS idea

Tim Cook Sat, 15 Oct 2011 17:16:04 -0700

On Sat, Oct 15, 2011 at 6:57 PM, Jim Klimov <jimkli...@cos.ru> wrote:


> Thanks to all that replied. I hope we may continue the discussion,
> but I'm afraid the overall verdict so far is disapproval of the idea.
> It is my understanding that those active in discussion considered
> it either too limited (in application - for VMs, or for hardware cfg),
> or too difficult to implement, so that we should rather use some
> alternative solutions. Or at least research them better (thanks Nico).
>
> I guess I am happy to not have seen replies like "won't work
> at all, period" or "useless, period". I get "Difficult" and "Limited"
> and hope these can be worked around sometime, and hopefully
> this discussion would spark some interest in other software
> authors or customers to suggest more solutions and applications -
> to make some shared ZFS a possibility ;)
>
> Still, I would like to clear up some misunderstandings in replies -
> because at times we seemed to have been speaking about
> different architectures. Thanks to Richard, I stated what exact
> hardware I had in mind (and wanted to use most efficiently)
> while thinking about this problem, and how it is different from
> "general" extensible computers or server+NAS networks.
>
> Namely, with the shared storage architecture built into Intel
> MFSYS25 blade chassis and lack of expansibility of servers
> beyond that, some suggested solutions are not applicable
> (10GbE, FC, Infiniband) but some networking problems
> are already solved in hardware (full and equal connectivity
> between all servers and all shared storage LUNs).
>
> So some combined replies follow below:
>
> 2011-10-15, Richard Elling and Edward Ned Harver and Nico Williams wrote:
>
>> >  #1 - You seem to be assuming storage is slower when it's on a remote
>> storage
>> >  server as opposed to a local disk.  While this is typically true over
>> >  ethernet, it's not necessarily true over infiniband or fibre channel.
>> Many people today are deploying 10GbE and it is relatively easy to get
>> wire speed
>> for bandwidth and<  0.1 ms average access for storage.
>>
>
> Well, I am afraid I have to reiterate: for a number of reasons including
> price, our customers are choosing some specific and relatively fixed
> hardware solutions. So, time and again, I am afraid I'll have to remind
> of the sandbox I'm tucked into - I have to do with these boxes, and I
> want to do the best with them.
>
> I understand that Richard comes from a background where HW is the
> flexible part in equations and software is designed to be used for
> years. But  for many people (especially those oriented at fast-evolving
> free software) the hardware is something they have to BUY and it
> works unchanged as long as possible. This does not only cover
> enthusiasts like the proverbial "red-eyed linuxoids", but also many
> small businesses. I do still maintain several decade-old computers
> running infrastructure tasks (luckily, floorspace and electricity are
> near-free there) which were not yet virtualized because "if it ain't
> broken - don't touch it!" ;)
>
> In particular, the blade chassis in my example, which I hoped to
> utilize to their best, using shared ZFS pools, have no extension
> slots. There is no 10GbE for neither external RJ45 nor internal
> ports (technically there is 10GbE interlink of two switch modules),
> so each server blade is limited to have either 2 or 4 1Gbps ports.
> There is no FC. No infiniband. There may be one extSAS link
> on each storage controller module, that's it.
>
>  I think the biggest problem lies in requiring full
>> connectivity from every server to every LUN.
>>
>
> This is exactly (and the only) sort of connectivity available to
> server blades in this chassis.
>
> I think this is as applicable to networked storage where there
> is a mesh of reliable connections between disk controllers
> and disks (or at least LUNs), be it switched FC or dual-link
> SAS or whatnot.
>
>  Doing something like VMotion would be largely pointless if the VM storage
>> still remains on the node that was previously the compute head.
>>
>
> True. However, in these Intel MFSYS25 boxes no server blade
> has any local disks (unlike most other blades I know). Any disk
> space is fed to them - and is equally accessible over a HA link -
> from the storage controller modules (which are in turn connected
> to the built-in array of hard-disks) that are a part of the chassis
> shared by all servers, like the networking switches are.
>
>  If you do the same thing over ethernet, then the performance will be
>> degraded to ethernet speeds.  So take it for granted, no matter what you
>> do,
>> you either need a bus that performs just as well remotely versus
>> locally...
>> Or else performance will be degraded...  Or else it's kind of pointless
>> because the VM storage lives only on the system that you want to VMotion
>> away from.
>>
>
> Well, while this is no Infiniband, in terms of disk access this
> paragraph is applicable to MFSYS chassis: disk access
> via storage controller modules can be considered a fast
> common bus - if this comforts readers into understanding
> my idea better. And yes, I do also think that channeling
> disk over ethernet via one of the servers is a bad thing
> bound to degrade performance as opposed to what can
> be had anyway with direct disk access.
>
>  Ethernet has *always* been faster than a HDD. Even back when we had 3/180s
>> 10Mbps Ethernet it was faster than the 30ms average access time for the
>> disks of
>> the day. I tested a simple server the other day and round-trip for 4KB of
>> data on a
>> busy 1GbE switch was 0.2ms. Can you show a HDD as fast? Indeed many SSDs
>> have trouble reaching that rate under load.
>>
>
> As noted by other posters, access times are not bandwidth.
> So these are two different "faster"'s ;) Besides, (1Gbps)
> Ethernet is faster than a single HDD stream. But it is not
> quite faster than an array of 14HDDs...
>
> And if Ethernet is utilized by its direct tasks - whatever they
> be, say video streaming off this server to 5000 viewers or
> whatever is needed to saturate the network, disk access
> over the same ethernet link would have to compete. And
> whatever the QoS settings, viewers would lose - either the
> real-time multimedia signal would lag, or the disk data to
> feed it.
>
> Moreover, usage of an external NAS (a dedicated server
> with Ethernet connection to the blade chassis) would make
> an external box dedicated and perhaps optimized to storage
> tasks (i.e. with ZIL/L2ARC), and would free up a blade for
> VM farming needs, but it would consume much of the LAN
> bandwidth of the blades using its storage services.
>
>  Today, HDDs aren't fast, and are not getting faster.
>>  -- richard
>>
> Well, typical consumer disks did get about 2-3 times faster for
> linear RW speeds over the past decade; but for random access
> they do still lag a lot. So, "agreed" ;)
>
> //Jim
>
>

Quite frankly your choice in blade chassis was a horrible design decision.
 From your description of its limitations it should never be the building
block for a vmware cluster in the first place.  I would start by rethinking
that decision instead of trying to pound a round ZFS peg into a square hole.

--Tim

_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Wanted: sanity check for a clustered ZFS idea

Reply via email to