Re: [zfs-discuss] Proposed 2540 and ZFS configuration

2008-09-02 Thread Mertol Ozyoney
That's exactly what I said in a private email. J4200 or J4400 can offer
better price/performance. However the price difference is not as much as you
think. Besides 2540 have a few function that can not be found on J series ,
like SAN connectivity, internal redundant raid controllers [redundancy is
good and you can make use of the controllers when connected to some other
hosts ilke windows servers] , ability to change stripe size/raid level and
other paramters on the go etc.. 





Mertol Ozyoney 
Storage Practice - Sales Manager

Sun Microsystems, TR
Istanbul TR
Phone +902123352200
Mobile +905339310752
Fax +90212335
Email [EMAIL PROTECTED]


-Original Message-
From: Al Hopper [mailto:[EMAIL PROTECTED] 
Sent: Tuesday, September 02, 2008 3:53 AM
To: [EMAIL PROTECTED]
Cc: zfs-discuss@opensolaris.org
Subject: Re: [zfs-discuss] Proposed 2540 and ZFS configuration

On Mon, Sep 1, 2008 at 5:18 PM, Mertol Ozyoney <[EMAIL PROTECTED]>
wrote:
> A few quick notes.
>
> 2540's first 12 drives are extremely fast due to the fact that they have
> direct unshared connections. I do not mean that additional disks are slow,
I
> want to say that first 12 is extremely fast, compared to any other disk
> system.
>
> So although it's a little bit expansive but it could be a lot faster to
add
> a second 2540 , than adding a second drive expansion.
>
> We generally use a few 2540's with 12 drives running in parallel for
extreme
> performances.
>
> Agin with the additional disk tray 2540 will still perfom quite good, for
> the extreme performance

For this application the 2540 is overkill and a poor fit.  I'd
recommend a J4xxx series JBOD array and and matching SAS
controller(s).  With enough memory in the ZFS host, you don't need
hardware RAID with buffer RAM.   Spend your dollars where you'll get
the best payback - buy more drives and max out the RAM on the ZFS
host!!

In fact, if it's not too late, I'd return the 2540

Regards,

-- 
Al Hopper Logical Approach Inc,Plano,TX [EMAIL PROTECTED]
 Voice: 972.379.2133 Timezone: US CDT
OpenSolaris Governing Board (OGB) Member - Apr 2005 to Mar 2007
http://www.opensolaris.org/os/community/ogb/ogb_2005-2007/

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

2008-09-02 Thread Ross Smith

Thinking about it, we could make use of this too.  The ability to add a
remote iSCSI mirror to any pool without sacrificing local performance
could be a huge benefit.


> From: [EMAIL PROTECTED]
> To: [EMAIL PROTECTED]
> CC: [EMAIL PROTECTED]; zfs-discuss@opensolaris.org
> Subject: Re: Availability: ZFS needs to handle disk removal / driver failure 
> better
> Date: Fri, 29 Aug 2008 09:15:41 +1200
> 
> Eric Schrock writes:
> > 
> > A better option would be to not use this to perform FMA diagnosis, but
> > instead work into the mirror child selection code.  This has already
> > been alluded to before, but it would be cool to keep track of latency
> > over time, and use this to both a) prefer one drive over another when
> > selecting the child and b) proactively timeout/ignore results from one
> > child and select the other if it's taking longer than some historical
> > standard deviation.  This keeps away from diagnosing drives as faulty,
> > but does allow ZFS to make better choices and maintain response times.
> > It shouldn't be hard to keep track of the average and/or standard
> > deviation and use it for selection; proactively timing out the slow I/Os
> > is much trickier. 
> > 
> This would be a good solution to the remote iSCSI mirror configuration.  
> I've been working though this situation with a client (we have been 
> comparing ZFS with Cleversafe) and we'd love to be able to get the read 
> performance of the local drives from such a pool. 
> 
> > As others have mentioned, things get more difficult with writes.  If I
> > issue a write to both halves of a mirror, should I return when the first
> > one completes, or when both complete?  One possibility is to expose this
> > as a tunable, but any such "best effort RAS" is a little dicey because
> > you have very little visibility into the state of the pool in this
> > scenario - "is my data protected?" becomes a very difficult question to
> > answer. 
> > 
> One solution (again, to be used with a remote mirror) is the three way 
> mirror.  If two devices are local and one remote, data is safe once the two 
> local writes return.  I guess the issue then changes from "is my data safe" 
> to "how safe is my data".  I would be reluctant to deploy a remote mirror 
> device without local redundancy, so this probably won't be an uncommon 
> setup.  There would have to be an acceptable window of risk when local data 
> isn't replicated. 
> 
> Ian

_
Make a mini you and download it into Windows Live Messenger
http://clk.atdmt.com/UKM/go/111354029/direct/01/___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] virtualbox & opensolaris b95 - zfs issue

2008-09-02 Thread Robert Milkowski
Hello zfs-discuss,

  I installed Open Solaris 2008.05 on my notebook then I
  upgraded it to b95 (following required procedure). Everything
  worked fine.

  So now I booted into Windows, installed virtual box and wanted
  it to boot OS from physical partition.

  So I created vmdk representing entire disk and another one
  representing just that one partition.

  I can boot it into GRUB then I try to boot opensolaris and it
  load kernel fine but then zfs is complaining it can't mount
  rootfs. So I mounted OS livecd image, booted from it with vmdk
  representing the partition also presented, I was able to import
  rpool without any issues and I did devfsadm -Cv inside it.

  So now again, I'm trying to boot from disk and I'm getting on a
  console (I did booted it with kmdb in order to be able to
  intercept the message):

  [...]
  NOTICE: zfs_parse_bootfs: error 19

  panic[cpu0]/thread=fec1cfe0: cannot mount root path /[EMAIL 
PROTECTED],0/[EMAIL PROTECTED],1/[EMAIL PROTECTED]/[EMAIL PROTECTED],0:a

  fec351ac genunix:rootconf +10b (c0f040, 1, fec1c750)
  fec351d0 genunix:vfs_mountroot+54 (fe800010, fec30fd8)
  fec351e4 genunix:main+b9 ()

  [...]

  I guess error 19 will be ENODEV coming from spa_vdev_attach.
  The rootpath it is trying to use seems fine (compared with the
  one I get if I boot it from livecd with a disk still presented).
  
  From kmdb:

  ::spa
  d3ccb200 IOFAILURE rpool

  ::spa -c
  [phys_path is wrong but I'm not sure if it matters - what -c
  actually does?]


  
  If I boot from livecd in above config I'm able to import rpool
  so looks like virtualbox does properly present disk to system.

  Any ideas?
  


-- 
Best regards,
 Robert Milkowski  mailto:[EMAIL PROTECTED]
 http://milek.blogspot.com

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] RFE: allow zfs to interpret '.' as da datatset?

2008-09-02 Thread Mark J Musante
On Mon, 1 Sep 2008, Gavin Maltby wrote:

> I'd like to be able to utter cmdlines such as
>
> $ zfs set readonly=on .
> $ zfs snapshot [EMAIL PROTECTED]
>
> with '.' interpreted to mean the dataset corresponding to the current 
> working directory.

Sounds like it would be a useful RFE.

> This would shorten what I find to be a very common operaration - that of 
> discovering your current (working directory) dataset and performing some 
> operation on it.  I usally do this with df and some cut and paste:

There's an easier way: the zfs list command can take a pathname argument 
and, by using some options, you can filter the output:

cur_datset="$(zfs list -Ho name $(pwd))"


Regards,
markm
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Proposed 2540 and ZFS configuration

2008-09-02 Thread Bob Friesenhahn
On Tue, 2 Sep 2008, Mertol Ozyoney wrote:

> That's exactly what I said in a private email. J4200 or J4400 can offer
> better price/performance. However the price difference is not as much as you
> think. Besides 2540 have a few function that can not be found on J series ,
> like SAN connectivity, internal redundant raid controllers [redundancy is
> good and you can make use of the controllers when connected to some other
> hosts ilke windows servers] , ability to change stripe size/raid level and
> other paramters on the go etc..

It seems that the cost is the base chassis cost and then the price 
that Sun charges per disk drive.  Unless you choose cheap SATA drives, 
or do not fully populate the chassis, the chassis cost will not be a 
big factor.  Compared with other Sun products and other vendors (e.g. 
IBM), Sun is fairly competitive with its disk drive pricing for the 
2540.  The fiber channel can be quite a benefit since it does not care 
about distance and offers a bit more bandwidth than SAS.  With SAS, 
the server and the drive array pretty much need to be in the same rack 
and close together as well.  A drawback of fiber channel is that the 
host adaptor is much more (3X to 4X) more expensive.

A big difference between the J series and the 2540 is how Sun sells 
it.  The J series is sold in a minimal configuration with the user 
adding drives as needed whereas the 2540 is sold in certain 
pre-configured maximal configurations.  This means that the starting 
cost for the J series is much lower.

Bob
==
Bob Friesenhahn
[EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Proposed 2540 and ZFS configuration

2008-09-02 Thread Kenny
Bob,

I used your script (thanks) but I fail to see which controller controls which 
disk... Your white paper shows six luns with the active state first and then 
six with the active state second, however mine all show active state first.

Yes, I've verified that both controllers are up and CAM see them both.  
Mpathadm reports 4 paths to each lun.


Active state output.

bash-3.00# ./mpath.sh
=== /dev/rdsk/c6t600A0B800049E81C03BF48BD0510d0s2 ===
Current Load Balance:  round-robin
Access State:  active
Access State:  standby
=== /dev/rdsk/c6t600A0B800049E81C03BC48BD04D2d0s2 ===
Current Load Balance:  round-robin
Access State:  active
Access State:  standby
=== /dev/rdsk/c6t600A0B800049E81C03B948BD0494d0s2 ===
Current Load Balance:  round-robin
Access State:  active
Access State:  standby
=== /dev/rdsk/c6t600A0B800049E81C03B648BD044Ed0s2 ===
Current Load Balance:  round-robin
Access State:  active
Access State:  standby
=== /dev/rdsk/c6t600A0B800049E81C03B348BD03FAd0s2 ===
Current Load Balance:  round-robin
Access State:  active
Access State:  standby
=== /dev/rdsk/c6t600A0B800049E81C03B048BD03BAd0s2 ===
Current Load Balance:  round-robin
Access State:  active
Access State:  standby
=== /dev/rdsk/c6t600A0B800049E81C03AD48BD0376d0s2 ===
Current Load Balance:  round-robin
Access State:  active
Access State:  standby
=== /dev/rdsk/c6t600A0B800049E81C03AA48BD0338d0s2 ===
Current Load Balance:  round-robin
Access State:  active
Access State:  standby
=== /dev/rdsk/c6t600A0B800049E81C03A748BD02FAd0s2 ===
Current Load Balance:  round-robin
Access State:  active
Access State:  standby
=== /dev/rdsk/c6t600A0B800049E81C03A448BD02BAd0s2 ===
Current Load Balance:  round-robin
Access State:  active
Access State:  standby
=== /dev/rdsk/c6t600A0B800049E81C03A148BD0276d0s2 ===
Current Load Balance:  round-robin
Access State:  active
Access State:  standby
=== /dev/rdsk/c6t600A0B800049E81C03A448BD02BAd0s2 ===
Current Load Balance:  round-robin
Access State:  active
Access State:  standby
=== /dev/rdsk/c6t600A0B800049E81C03A148BD0276d0s2 ===
Current Load Balance:  round-robin
Access State:  active
Access State:  standby


Suggestions on where I might have messed up??

Thanks!!   --Kenny
--
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Proposed 2540 and ZFS configuration

2008-09-02 Thread Will Murnane
On Tue, Sep 2, 2008 at 11:44, Bob Friesenhahn
<[EMAIL PROTECTED]> wrote:
> The fiber channel ... offers a bit more bandwidth than SAS.
The bandwidth part of this statement is not accurate.  SAS uses wide
ports composed of (usually, other widths are possible) four 3 gbit
links.  Each of these has a data rate up to 300 MB/s (not 375, due to
8/10b coding).  Thus, a "single" SAS cable carries 1.2GB/s, while a
single FC link carries 400 MB/s.  SAS links can be up to 8 meters
long, although of course this does not compete with the km-long links
FC can achieve.

Will
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Proposed 2540 and ZFS configuration

2008-09-02 Thread Bob Friesenhahn
On Tue, 2 Sep 2008, Kenny wrote:
>
> I used your script (thanks) but I fail to see which controller 
> controls which disk... Your white paper shows six luns with the 
> active state first and then six with the active state second, 
> however mine all show active state first.
>
> Yes, I've verified that both controllers are up and CAM see them 
> both.  Mpathadm reports 4 paths to each lun.

This is very interesting.  You are saying that 'mpathadm list lu' 
produces 'Total Path Count: 4' and 'Operational Path Count: 4'?  Do 
you have four FC connections between the server and the array?

What text is output from 'mpathadm show lu' for one of the LUNs?

> Suggestions on where I might have messed up??

Perhaps not at all.  The approach used in my white paper is based on 
suppositions from my self since Sun has not offered any architectural 
documentation on the 2540 or the internal workings of MPxIO.

If you do have four FC connections it may be that MPxIO works 
differently and the trick I exploited is not valid.

Bob
==
Bob Friesenhahn
[EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] raidz2 group size

2008-09-02 Thread Barton Fisk
Hi,
Forgive my ignorance of ZFS, but I have a customer that would like to set up 
three 14+2 raidz2 groups on a new thor with 48 1TB drives (updated thumper) so 
that 42TB for data could be achieved. What performance or other technical 
issues with a stripe 14 disks wide would he likely see? He does not want a hot 
spare.
Any advice appreciated in advance.
--
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] raidz2 group size

2008-09-02 Thread Will Murnane
On Tue, Sep 2, 2008 at 15:39, Barton Fisk <[EMAIL PROTECTED]> wrote:
> Hi,
> Forgive my ignorance of ZFS, but I have a customer that would like to set up 
> three 14+2 raidz2 groups on a new thor with 48 1TB drives (updated thumper) 
> so that 42TB for data could be achieved. What performance or other technical 
> issues with a stripe 14 disks wide would he likely see? He does not want a 
> hot spare.
Well, first, you need a boot disk of some sort.  Thor does have a CF
slot, so that's an option, but if the customer plans to do, say, home
directories on the root directory, using disks might be a good place
to start.

Second, random read/write performance is going to suck a lot compared
to narrower stripes.  Raidz{,2} groups need to read from N-{1,2} disks
so that they can verify on reads that the checksum matches.  Thus,
with a 16-disk raidz2 group, you must wait for 14 IOs to complete
before you can decide whether you've got junk back from any disk.
Smaller groups would help this quite a bit, both because there are
more groups and because they're smaller.  Even four 10+2 groups (for a
still-respectable 40 TB) would be better, and six 6+2 groups (for 36
TB) would be my recommendation.  If space is really that much of a
concern, or you're doing sequential transfers to large files, 14+2
groups will probably be survivable... but I wouldn't use them.

Will
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] raidz2 group size

2008-09-02 Thread Barton Fisk
Sorry I omitted that CF will be the boot device. Thanks again.
--
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] raidz2 group size

2008-09-02 Thread Ian Collins
Barton Fisk wrote:
> Sorry I omitted that CF will be the boot device. Thanks again.
>   

What are you using for redundancy of the boot device?

Ian

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] raidz2 group size

2008-09-02 Thread Richard Elling
Barton Fisk wrote:
> Hi,
> Forgive my ignorance of ZFS, but I have a customer that would like to set up 
> three 14+2 raidz2 groups on a new thor with 48 1TB drives (updated thumper) 
> so that 42TB for data could be achieved. What performance or other technical 
> issues with a stripe 14 disks wide would he likely see? He does not want a 
> hot spare.
> Any advice appreciated in advance.
>   

RAIDoptimizer was designed to help you work through the possibilities
and make trade-offs.  Note that the man page does not encourage such
wide sets, you can see why when you run RAIDoptimizer.
http://ras4sun.sfbay.sun.com/HTMLS/tools.html
 -- richard

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] raidz2 group size

2008-09-02 Thread Richard Elling
Richard Elling wrote:
> Barton Fisk wrote:
>   
>> Hi,
>> Forgive my ignorance of ZFS, but I have a customer that would like to set up 
>> three 14+2 raidz2 groups on a new thor with 48 1TB drives (updated thumper) 
>> so that 42TB for data could be achieved. What performance or other technical 
>> issues with a stripe 14 disks wide would he likely see? He does not want a 
>> hot spare.
>> Any advice appreciated in advance.
>>   
>> 
>
> RAIDoptimizer was designed to help you work through the possibilities
> and make trade-offs.  Note that the man page does not encourage such
> wide sets, you can see why when you run RAIDoptimizer.
> http://ras4sun.sfbay.sun.com/HTMLS/tools.html
>   

Silly me.  It is still Monday, and I am coffee challenged.  RAIDoptimizer
is still an internal tool.  However, for those who are interested in the 
results
of a RAIDoptimizer run for 48 disks, see:
http://blogs.sun.com/relling/entry/sample_raidoptimizer_output
 -- richard

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Formatting Problem of ZFS Adm Guide (pdf)

2008-09-02 Thread W. Wayne Liauh
> ZFS Administration Guide (in PDF format) does not
> look very professional (at least on
> Evince/OS2008.05).  Please see attached screenshot.

I have cleaned up the original pdf file.  Please see:

http://tinyurl.com/zfs-pdf

The invisible parts (original) are now visible (corrected).

It is not too difficult a job.

Just wish that we can be more thoughtful.

The only people affected are us hard-core OpenSolaris users.
--
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Sidebar to ZFS Availability discussion

2008-09-02 Thread Bill Sommerfeld
On Sun, 2008-08-31 at 12:00 -0700, Richard Elling wrote:
> 2. The algorithm *must* be computationally efficient.
>We are looking down the tunnel at I/O systems that can
>deliver on the order of 5 Million iops.  We really won't
>have many (any?) spare cycles to play with.

If you pick the constants carefully (powers of two) you can do the TCP
RTT + variance estimation using only a handful of shifts, adds, and
subtracts.

> In both of these cases, the solutions imply multi-minute timeouts are
> required to maintain a stable system.  

Again, there are different uses for timeouts:
 1) how long should we wait on an ordinary request before deciding to
try "plan B" and go elsewhere (a la B_FAILFAST)
 2) how long should we wait (while trying all alternatives) before
declaring an overall failure and giving up.

The RTT estimation approach is really only suitable for the former,
where you have some alternatives available (retransmission in the case
of TCP; trying another disk in the case of mirrors, etc.,).  

when you've tried all the alternatives and nobody's responding, there's
no substitute for just retrying for a long time.

- Bill


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Sidebar to ZFS Availability discussion

2008-09-02 Thread Bill Sommerfeld
On Sun, 2008-08-31 at 15:03 -0400, Miles Nordin wrote:

> It's sort of like network QoS, but not quite, because: 
> 
>   (a) you don't know exactly how big the ``pipe'' is, only
>   approximately, 

In an ip network, end nodes generally know no more than the pipe size of
the first hop -- and in some cases (such as true CSMA networks like
classical ethernet or wireless) only have an upper bound on the pipe
size.  

beyond that, they can only estimate the characteristics of the rest of
the network by observing its behavior - all they get is end-to-end
latency, and *maybe* a 'congestion observed' mark set by an intermediate
system.

>   (c) all the fabrics are lossless, so while there are queues which
>   undesireably fill up during congestion, these queues never drop
>   ``packets'' but instead exert back-pressure all the way up to
>   the top of the stack.

hmm.  I don't think the back pressure makes it all the way up to zfs
(the top of the block storage stack) except as added latency.  

(on the other hand, if it did, zfs could schedule around it both for
reads and writes, avoiding pouring more work on already-congested
paths..)

> I'm surprised we survive as well as we do without disk QoS.  Are the
> storage vendors already doing it somehow?

I bet that (as with networking) in many/most cases overprovisioning the
hardware and running at lower average utilization is often cheaper in
practice than running close to the edge and spending a lot of expensive
expert time monitoring performance and tweaking QoS parameters.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Sidebar to ZFS Availability discussion

2008-09-02 Thread Miles Nordin
> "bs" == Bill Sommerfeld <[EMAIL PROTECTED]> writes:

bs> In an ip network, end nodes generally know no more than the
bs> pipe size of the first hop -- and in some cases (such as true
bs> CSMA networks like classical ethernet or wireless) only have
bs> an upper bound on the pipe size.

yeah, but the most complicated and well-studied queueing disciplines
(like, everything implemented in ALTQ and I think everything
implemented by the two different Cisco queueing frameworks (the CBQ
process-switched one, and the diffserv-like cat6500 ASIC-switched
one)) is (a) hop-by-hop, so the algorithm one discusses only applies
to a single hop, a single transmit queue, never to a whole path, and
(b) assumes a unidirectional link of known fixed size, not a broadcast
link or token ring or anything like that.

For wireless they are not using the fancy algorithms.  They're doing
really primitive things like ``unsolicited grants''---basically just
TDMA channels.

I wouldn't think of ECN as part of QoS exactly, because it separates
so cleanly from your choice of queue discipline.

bs> hmm.  I don't think the back pressure makes it all the way up
bs> to zfs

I guess I was thinking of the lossless fabrics, which might change
some of the assumptions behind designing a scheduler that went into IP
QoS.  For example, most of the IP QoS systems divide the usual
one-big-queue into many smaller queues.  A ``classifier'' picks some
packets as pink ones and some as blue, and assigns them thusly to
queues, and they always get classified to the end of the queue.  The
``scheduler'' then decides from which queue to take the next packet.
The primitive QoS in Ethernet chips might give you 4 queues that are
either strict-priority or weighted-round-robin.  Link-sharing
schedulers like CBQ or HFSC make a heirarchy of queues where, to the
extent that they're work-conserving, child queues borrow unused
transmission slots from their ancestors.  Or a flat 256 hash-bucket
queues for WFQ, which just tries to separate one job from another.

but no matter which of those you choose, within each of the smaller
queues you get an orthogonal choice of RED or FIFO.  There's no such
thing as RED or FIFO with queues in storage networks because there is
no packet dropping.

This confuses the implementation of the upper queueing discipline
because what happens when one of the small queues fills up?  How can
you push up the stack, ``I will not accept another CDB if I would
classify it as a Pink CDB, because the Pink queue is full.  I will
still accept Blue CDB's though.''  Needing to express this destroys
the modularity of the IP QoS model.  We can only say ``block---no more
CDB's accepted,'' but that defeats the whole purpose of the QoS!  so
how to say no more CDB's of the pink kind?  With normal hop-by-hop
QoS, I don't think we can.

This inexpressability of ``no more pink CDB's'' is the same reason
enterprise Ethernet switches never actually use the gigabit ethernet
``flow control'' mechanism.  Yeah, they negotiate flow control and
obey received flow control signals, but they never _assert_ a flow
control signal, at least not for normal output-queue congestion,
because this would block reception of packets that would get switched
to uncongested output ports, too.  Proper enterprise switches would
assert flow control only for rare pathological cases like backplane
saturation or cheap oversubscribed line cards.  No matter what
overzealous powerpoint monkeys claim, CEE/FCoE is _not_ going to use
``pause frames.''

I guess you're right that some of the ``queues'' in storage are sort
of arbitrarily sized, like the write queue which could take up the
whole buffer cache, so back pressure might not be the right way to
imagine it.


pgpheHnFkXcqX.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] raidz2 group size

2008-09-02 Thread Brandon High
On Tue, Sep 2, 2008 at 2:15 PM, Richard Elling <[EMAIL PROTECTED]> wrote:
> Silly me.  It is still Monday, and I am coffee challenged.  RAIDoptimizer
> is still an internal tool.  However, for those who are interested in the
> results
> of a RAIDoptimizer run for 48 disks, see:
> http://blogs.sun.com/relling/entry/sample_raidoptimizer_output


Richard --

Is there a chance of RAIDoptimizer will be made available to the
unwashed masses?

Could you post the results for a few runs with other numbers of disks,
such as 8 (which is the number of drives I plan to use) or 12 (the
number of drives in the 2510, etc)?

-B

-- 
Brandon High [EMAIL PROTECTED]
"You can't blow things up with schools and hospitals." -Stephen Dailey
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss