On May 31, 2006, at 8:56 AM, Roch Bourbonnais - Performance
Engineering wrote:
I'm not taking a stance on this, but if I keep a controler
full of 128K I/Os and assuming there are targetting
contiguous physical blocks, how different is that to issuing
a very large I/O ?
There are differences at the host, the HBA, the disk or RAID
controller, and on the wire.
At the host:
The SCSI/FC/ATA stack is run once for each I/O. This takes
a bit of CPU. We generally take one interrupt for each I/O
(if the CPU is fast enough), so instead of taking one
interrupt for 8 MB (for instance), we take 64.
We run through the IOMMU or page translation code once per
page, but the overhead of initially setting up the IOMMU or
starting the translation loop happens once per I/O.
At the HBA:
There is some overhead each time that the controller switches
processing from one I/O to another. This isn't too large on a
modern system, but it does add up.
There is overhead on the PCI (or other) bus for the small
transfers that make up the command block and scatter/gather
list for each I/O. Again, it adds up (faster than you might
expect, since PCI Express can move 128 KB very quickly).
There is a limit on the maximum number of outstanding I/O
requests, but we're unlikely to hit this in normal use; it
is typically at least 256 and more often 1024 or more on
newer hardware. (This is shared for the whole channel
in the FC and SCSI case, and may be shared between multiple
channels for SAS or multi-port FC cards.)
There is often a small cache of commands which can be handled
quickly; commands outside of this cache (which may hold 4 to
16 or so) are much slower to "context-switch" in when their
data is needed; in particular, the scatter/gather list may
need to be read again.
At the disk or RAID:
There is a fixed overhead for processing each command. This
can be fairly readily measured, and roughly reflects the
difference between delivered 512-byte IOPs and bandwidth for
a large I/O. Some of it is related to parsing the CDB and
starting command execution; some of it is related to cache
management.
There is some overhead for switching between data transfers
for each command. A typical track on a disk may hold 400K
or so of data, and a full-track transfer is optimal (runs at
platter speed). A partial-track transfer immediately followed
by another may take enough time to switch that we sometimes
lose one revolution (particularly on disks which do not have
sector headers). Write caching should nearly eliminate this
as a concern, however.
There is a fixed-size window of commands that can be
reordered on the device. Data transfer within a command can
be reordered arbitrarily (for parallel SCSI and FC, though
not for ATA or SAS). It's good to have lots of outstanding
commands, but if they are all sequential, there's not much
point (no reason to reorder them, except perhaps if you're
going backwards, and FC/SCSI can handle this anyway).
On the wire:
Sending a command and its completion takes time that could
be spent moving data instead; but for most protocols this
probably isn't significant.
You can actually see most of this with a PCI and protocol
analyzer.
-- Anton
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss