Re: [zfs-discuss] What causes slow performance under load?

Gary Mills Sun, 19 Apr 2009 08:59:53 -0700

On Sat, Apr 18, 2009 at 11:45:54PM -0500, Mike Gerdts wrote:
> [perf-discuss cc'd]
> 
> On Sat, Apr 18, 2009 at 4:27 PM, Gary Mills <mi...@cc.umanitoba.ca> wrote:
> > Many other layers are involved in this server.  We use scsi_vhci for
> > redundant I/O paths and Sun's Iscsi initiator to connect to the
> > storage on our Netapp filer.  The kernel plays a part as well.  How
> > do we determine which layer is responsible for the slow performance?
> 
> Have you disabled the nagle algorithm for the iscsi initiator?
> 
> http://bugs.opensolaris.org/view_bug.do?bug_id=6772828


I tried that on our test IMAP backend the other day.  It made no
significant difference to read or write times or to ZFS I/O bandwidth.
I conclude that the Iscsi initiator has already sized its TCP packets
to avoid Nagle delays.

> Also, you may want to consider doing backups from the NetApp rather
> than from the Solaris box.

I've certainly recommended finding a different way to perform backups.

> Assuming all of your LUNs are in the same
> volume on the filer, a snapshot should be a crash-consistent image of
> the zpool.  You could verify this by making the snapshot rw and trying
> to import the snapshotted LUNs on another host.

That part sounds scary!  The filer exports four LUNs that are combined
into one ZFS pool on the IMAP server.  These LUNs are volumes on the
filer.  How can we safely import them on another host?

> Anyway, this would
> remove the backup-related stress on the T2000.  You can still do
> snapshots at the ZFS layer to give you file level restores.  If the
> NetApp caught on fire, you would simply need to restore the volume
> containing the LUNs (presumably a small collection of large files)
> which would go a lot quicker than a large collection of small files.

Yes, a disaster recovery would be much quicker in this case.

> Since iSCSI is in the mix, you should also be sure that your network
> is appropriately tuned.  Assuming that you are using the onboard
> e1000g NICs, be sure that none of the "bad" counters are incrementing:
> 
> $ kstat -p e1000g | nawk '$0 ~ /err|drop|fail|no/ && $NF != 0'
> 
> If this gives any output, there is likely something amiss with your network.

Only this:
    e1000g:0:e1000g0:unknowns       1764449

I don't know what those are, but it's e1000g1 and e1000g2 that are
used for the Iscsi network.

> The output from "iostat -xCn 10" could be interesting as well.  If
> asvc_t is high (>30?), it means the filer is being slow to respond.
> If wsvc_t is frequently non-zero, there is some sort of a bottleneck
> that prevents the server from sending requests to the filer.  Perhaps
> you have tuned ssd_max_throttle or Solaris has backed off because the
> filer said to slow down.  (Assuming that ssd is used with iSCSI LUNs).

Here's an example, taken from one of the busy periods:

                    extended device statistics
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
    0.0    5.0    0.0    7.7  0.0  0.1    4.1   24.8   1   1 c1t2d0
   27.0   13.8 1523.4  172.9  0.0  0.5    0.0   11.8   0  38 
c4t60A98000433469764E4A2D456A644A74d0
   42.0   21.4 2027.3  350.0  0.0  0.9    0.0   13.9   0  60 
c4t60A98000433469764E4A2D456A696579d0
   40.8   25.0 1993.5  339.1  0.0  0.8    0.0   11.8   0  52 
c4t60A98000433469764E4A476D2F664E4Fd0
   42.0   26.6 1968.4  319.1  0.0  0.8    0.0   11.8   0  56 
c4t60A98000433469764E4A476D2F6B385Ad0

The service times seem okay to me.  There's no `throttle' setting in
any of the relevant driver conf files.

> What else is happening on the filer when mail gets slow?  That is, are
> you experiencing slowness due to a mail peak or due to some research
> project that happens to be on the same spindles?  What does the
> network look like from the NetApp side?

Our Netapp guy tells me that the filer is operating normally when the
problem occurs.  The Iscsi network is less than 10% utilized.

> Are the mail server and the NetApp attached to the same switch, or are
> they at opposite ends of the campus?  Is there something between them
> that is misbehaving?

I don't think so.  We have dedicated ethernet ports on both the IMAP
server and the filer for Iscsi, along with a pair of dedicated switches.

-- 
-Gary Mills-    -Unix Support-    -U of M Academic Computing and Networking-
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] What causes slow performance under load?

Reply via email to