On Mon, May 10, 2010 at 3:53 PM, Geoff Nordli <geo...@gnaa.net> wrote:
> Doesn't this alignment have more to do with aligning writes to the
> stripe/segment size of a traditional storage array?  The articles I am

It is a lot like a stripe / segment size. If you want to think of it
in those terms, you've got a segment of 512b (the iscsi block size)
and a width of 16, giving you an 8k stripe size. Any write that is
less than 8k will require a RMW cycle, and any write in multiples of
8k will do "full stripe" writes. If the write doesn't start on an 8k
boundary, you risk having writes span multiple underlying zvol blocks.

There's an explanation of WD's "Advanced Format" at Anandtech that
describes the problem with 4k physical sectors, here
http://www.anandtech.com/show/2888. Instead of sector, think zvol
block though.

When using a zvol, you've essentially got $volblocksize sized physical
sectors, but the initiator sees the 512b block size that the LUN is
reporting. If you don't block align, you risk having a write straddle
two zfs blocks. There may be some benefit to using a 4k volblocksize,
but you'll use more time and space on block checksums and, etc in your
zpool. I think 8k is a reasonable trade off.

> reading suggests creating a small unused partition to take up the space up
> to 127bytes (assuming 128byte segment), then create the real partition from
> the 128th sector going forward.  I am not sure how this would happen with
> zfs.

If you're using the whole disk with zfs, you don't need to worry about
it. If you're using fdisk partitions or slices, you need be a little
more careful.

I made an attempt to 4k block align the SSD that I'm using for a slog
/ L2ARC, which in theory should line up better with the devices erase
boundary. While not really pertinent to this discussion it gives some
idea on how to do it.

You want the filesystem to start at a point where ( $offset *
$sector_size * $sectors_per_cylinder ) % 4096 = 0.

For most LBA drives, you've got 16065 sectors/cylinder and 512b
sectors, giving 8 as the smallest offset that will align.
( 8 * 512 * 16065 ) % 4096 = 0

First you have to look at fdisk (on an SMI labeled disk) and realize
that you're going to lose the first cylinder to the MBR. When you then
create slices in format, it'll report one cylinder less than fdisk
did, so remember to account for that in your offset.

For an iscsi LUN used by a VM, you should align its filesystem on a
zvol block boundary. Windows Vista and Server 2008 use 240 heads & 63
sectors/track, so they are already 8k block aligned. Linux, Solaris,
and BSD also let you specify the geometry used by fdisk, but I wasn't
comfortable doing it with Solaris since you have to create a geometry
file first.

For my 30GB OCZ Vertex:

bh...@basestar:~$ pfexec fdisk -W - /dev/rdsk/c1t0d0p0
* /dev/rdsk/c1t0d0p0 default fdisk table
* Dimensions:
*    512 bytes/sector
*     63 sectors/track
*    255 tracks/cylinder
*   3892 cylinders
[..]
* Id    Act  Bhead  Bsect  Bcyl    Ehead  Esect  Ecyl    Rsect      Numsect
  191   128  0      1      1       254    63     1023    16065      62508915


bh...@basestar:~$ pfexec prtvtoc  /dev/rdsk/c1t0d0p0
* /dev/rdsk/c1t0d0p0 partition map
*
* Dimensions:
*     512 bytes/sector
*      63 sectors/track
*     255 tracks/cylinder
*   16065 sectors/cylinder
*    3891 cylinders
*    3889 accessible cylinders
*
* Flags:
*   1: unmountable
*  10: read-only
*
* Unallocated space:
*       First     Sector    Last
*       Sector     Count    Sector
*           0    112455    112454
*    62428590     48195  62476784
*
*                          First     Sector    Last
* Partition  Tag  Flags    Sector     Count    Sector  Mount Directory
       0      4    00     112455   2056320   2168774
       1      4    01    2168775  60243750  62412524
       2      5    01          0  62508915  62508914
       8      1    01          0     16065     16064


-B

-- 
Brandon High : bh...@freaks.com
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to