Re: [zfs-discuss] Re: [osol-discuss] Re: I wish Sun would open-source"QFS"... / was:Re: Re: Distributed File System for Solaris

Anton Rang Wed, 31 May 2006 07:48:56 -0700

On May 31, 2006, at 8:56 AM, Roch Bourbonnais - PerformanceEngineering wrote:

I'm not taking  a stance on this, but  if I keep a controler
full  of 128K   I/Os  and  assuming  there  are   targetting
contiguous physical blocks, how different is that to issuing
a very large I/O ?


There are differences at the host, the HBA, the disk or RAID
controller, and on the wire.

At the host:

  The SCSI/FC/ATA stack is run once for each I/O.  This takes
  a bit of CPU.  We generally take one interrupt for each I/O
  (if the CPU is fast enough), so instead of taking one
  interrupt for 8 MB (for instance), we take 64.

  We run through the IOMMU or page translation code once per
  page, but the overhead of initially setting up the IOMMU or
  starting the translation loop happens once per I/O.

At the HBA:

  There is some overhead each time that the controller switches
  processing from one I/O to another.  This isn't too large on a
  modern system, but it does add up.

  There is overhead on the PCI (or other) bus for the small
  transfers that make up the command block and scatter/gather
  list for each I/O.  Again, it adds up (faster than you might
  expect, since PCI Express can move 128 KB very quickly).

  There is a limit on the maximum number of outstanding I/O
  requests, but we're unlikely to hit this in normal use; it
  is typically at least 256 and more often 1024 or more on
  newer hardware.  (This is shared for the whole channel
  in the FC and SCSI case, and may be shared between multiple
  channels for SAS or multi-port FC cards.)

  There is often a small cache of commands which can be handled
  quickly; commands outside of this cache (which may hold 4 to
  16 or so) are much slower to "context-switch" in when their
  data is needed; in particular, the scatter/gather list may
  need to be read again.

At the disk or RAID:

  There is a fixed overhead for processing each command.  This
  can be fairly readily measured, and roughly reflects the
  difference between delivered 512-byte IOPs and bandwidth for
  a large I/O.  Some of it is related to parsing the CDB and
  starting command execution; some of it is related to cache
  management.

  There is some overhead for switching between data transfers
  for each command.  A typical track on a disk may hold 400K
  or so of data, and a full-track transfer is optimal (runs at
  platter speed).  A partial-track transfer immediately followed
  by another may take enough time to switch that we sometimes
  lose one revolution (particularly on disks which do not have
  sector headers).  Write caching should nearly eliminate this
  as a concern, however.

  There is a fixed-size window of commands that can be
  reordered on the device.  Data transfer within a command can
  be reordered arbitrarily (for parallel SCSI and FC, though
  not for ATA or SAS).  It's good to have lots of outstanding
  commands, but if they are all sequential, there's not much
  point (no reason to reorder them, except perhaps if you're
  going backwards, and FC/SCSI can handle this anyway).

On the wire:

  Sending a command and its completion takes time that could
  be spent moving data instead; but for most protocols this
  probably isn't significant.

You can actually see most of this with a PCI and protocol
analyzer.

-- Anton

_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Re: [osol-discuss] Re: I wish Sun would open-source"QFS"... / was:Re: Re: Distributed File System for Solaris

Reply via email to