Re: LVM write performance

Dion Kant Tue, 30 Aug 2011 13:17:56 -0700

On 08/20/2011 12:53 AM, Stan Hoeppner wrote:
> On 8/19/2011 4:38 PM, Dion Kant wrote:
>
>> I now think I understand the "strange" behaviour for block sizes not an
>> integral multiple of 4096 bytes. (Of course you guys already knew the
>> answer but just didn't want to make it easy for me to find the answer.)
>>
>> The newer disks today have a sector size of 4096 bytes. They may still
>> be reporting 512 bytes, but this is to keep some ancient OS-es  working.
>>
>> When a block write is not an integral of 4096 bytes, for example 512
>> bytes, 4095 or 8191 bytes, the driver must first read the sector, modify
>> it and finally write it back to the disk. This explains the bi and the
>> increased number of interrupts.
>>
>> I did some Google searches but did not find much. Can someone confirm
>> this hypothesis?
>
> The read-modify-write performance penalty of unaligned partitions on the
> "Advanced Format" drives (4KB native sectors) is a separate unrelated issue.
>
> As I demonstrated earlier in this thread, the performance drop seen when
> using dd with block sizes less than 4KB affects traditional 512B/sector
> drives as well.  If one has a misaligned partition on an Advanced Format
> drive, one takes a double performance hit when dd bs is less than 4KB.
>
> Again, everything in (x86) Linux is optimized around the 'magic' 4KB
> size, including page size, filesystem block size, and LVM block size.
Ok, I have done some browsing through the kernel sources. I understand
the VFS a bit better now. When a read/write is issued on a block device
file, the block size is 4096 bytes, i.e. reads/writes to the disk are
done using blocks equal to the page cache size, i.e. the magic 4KB.


Submitting a request with a block size which is not an integral multiple
of 4096 bytes results in a call to ll_rw_block(READ, 1, &bh), which
reads in 4096 blocks, one by one into the page cache. This must be done
before the user data can be used to partially update the concerning
buffer page in the cache. After being updated, the buffer is flagged
dirty and finally written to disk (8 sectors of 512 bytes).

I found a nice debugging switch which helps monitoring the process.

echo 1 > /proc/sys/vm/block_dump

makes all bio requests being logged as kernel output.

Example:

dd of=/dev/vg/d1 if=/dev/zero bs=4095 count=2 conv=sync

[  239.977384] dd(6110): READ block 0 on dm-3
[  240.026952] dd(6110): READ block 8 on dm-3
[  240.027735] dd(6110): WRITE block 0 on dm-3
[  240.027754] dd(6110): WRITE block 8 on dm-3

The ll_rw_block(READ, 1, &bh) is  causing the reads which can be seen
when monitoring with vmstat. The tests given below (as you requested)
were carried out before I gained a better understanding of the VFS.
However, remaining questions I still have are:

1. Why are the partial block updates (through ll_rw_block(READ, 1, &bh))
so dramatic slow as compared to other reads from the disk?

2. Furthermore remember the much better performance I reported when
mounting a file system on the block device first, before accessing the
disk through the block device file. If I find some more spare time I do
some more digging in the kernel. Maybe I find out that then a different
set of f_ops are used for accessing the raw block device by the Virtual
Filesystem Switch.

>
> BTW, did you run your test with each of the elevators, as I recommended?
>  Do the following, testing dd after each change.
>
$ echo 128 > /sys/block/sdc/queue/read_ahead_kb


dom0-2:~ # echo deadline > /sys/block/sdc/queue/scheduler
dom0-2:~ # cat /sys/block/sdc/queue/scheduler
noop [deadline] cfq 
dom0-2:~ # ./bw 
Writing 1 GB
     bs         time    rate
   (bytes)       (s)   (MiB/s) 
       512   54.0373   19.8704            1024
      1024   54.2937   19.7765            1024
      2048   52.1781   20.5784            1024
      4096    13.751   78.0846            1024
      8192   13.8519   77.5159            1024

dom0-2:~ # echo noop > /sys/block/sdc/queue/scheduler
dom0-2:~ # cat /sys/block/sdc/queue/scheduler
[noop] deadline cfq 
dom0-2:~ # ./bw 
Writing 1 GB
     bs         time    rate
   (bytes)       (s)   (MiB/s) 
       512   53.9634   19.8976            1024
      1024   52.0421   20.6322            1024
      2048   54.0437    19.868            1024
      4096   13.9612   76.9088            1024
      8192   13.8183   77.7043            1024

dom0-2:~ # echo cfq > /sys/block/sdc/queue/scheduler
dom0-2:~ # cat /sys/block/sdc/queue/scheduler
noop deadline [cfq] 
dom0-2:~ # ./bw 
Writing 1 GB
     bs         time    rate
   (bytes)       (s)   (MiB/s) 
       512   56.0087    19.171            1024
      1024    56.345   19.0565            1024
      2048   56.0436    19.159            1024
      4096   15.1232   70.9999            1024
      8192   15.4236   69.6168            1024

>
> Also, just for fun, and interesting results, increase your read_ahead_kb
> from the default 128 to 512.
>
> $ echo 512 > /sys/block/sdX/queue/read_ahead_kb
$ echo deadline > /sys/block/sdX/queue/scheduler

dom0-2:~ # ./bw
Writing 1 GB
     bs         time    rate
   (bytes)       (s)   (MiB/s)
       512   54.1023   19.8465            1024
      1024   52.1824   20.5767            1024
      2048   54.3797   19.7453            1024
      4096   13.7252   78.2315            1024
      8192    13.727   78.2211            1024

> $ echo noop > /sys/block/sdX/queue/scheduler
dom0-2:~ # ./bw
Writing 1 GB
     bs         time    rate
   (bytes)       (s)   (MiB/s)
       512   54.0853   19.8527            1024
      1024    54.525   19.6927            1024
      2048   50.6829   21.1855            1024
      4096   14.1272   76.0051            1024
      8192    13.914   77.1701            1024

> $ echo cfq > /sys/block/sdX/queue/scheduler
dom0-2:~ # ./bw
Writing 1 GB
     bs         time    rate
   (bytes)       (s)   (MiB/s)
       512   56.0274   19.1646            1024
      1024   55.7614    19.256            1024
      2048   56.5394    18.991            1024
      4096   16.0562   66.8739            1024
      8192   17.3842   61.7654            1024

Differences between deadline and noop are in the order of 2 to 3 % in
favour of deadline. Remarkable is the run with the cfq elevator. It
clearly has less performance, about 20% less (compared to the highest
result) for the 512 read_ahead_kb case. Another try with the same settings:

dom0-2:~ # ./bw
Writing 1 GB
     bs         time    rate
   (bytes)       (s)   (MiB/s)
       512   56.8122   18.8999            1024
      1024   56.5486   18.9879            1024
      2048   56.2555   19.0869            1024
      4096    14.886   72.1311            1024
      8192    15.461   69.4486            1024

so it looks like the previous result was at the low end of the
statistical variation.


>
> These changes are volatile so a reboot clears them in the event you're
> unable to change them back to the defaults for any reason.  This is
> easily avoidable if you simply cat the files and write down the values
> before changing them.  After testing, echo the default values back in.
>
I did some testing on a newer system with an *AOC-USAS-S4i Adaptec
AACRAID* Controller on a Supermicro. It uses the aacraid driver. This
controller supports RAID0,1,10 but with configuring the controller in a
way that it published the disks as 4 single disk RAID0 to Linux (the
controller cannot do JBOD), we obtained much better performance with
Linux software RAID0, or striping with LVM or LVM on top of RAID0 as
compared to RAID0 being managed by the controller. Now we obtain 300 to
350 MByte/s sustained write performance as about 150 MB/s when using the
controller.

*We use 4 ST32000644NS drives.

Repeating the tests on this system gives similar results, let alone that
the 2 TB drives have a better write performance of about 50%.


capture4:~ # cat  /sys/block/sdc/queue/read_ahead_kb
128
capture4:~ # cat /sys/block/sdc/queue/scheduler
noop [deadline] cfq

capture4:~ # ./bw /dev/sdc1
Writing 1 GB
     bs         time    rate
   (bytes)       (s)   (MiB/s)
      8192    8.5879    125.03            1024
      4096   8.54407   125.671            1024
      2048   65.0727   16.5007            1024

Note the performance drop by a factor 1/8 halving the bs from 4096 to 2048.

Reading a drive is 8.8% faster and works for all block sizes:

capture4:~ # ./br /dev/sdc1
Reading 1 GB
     bs         time    rate
   (bytes)       (s)   (MiB/s)
       512   7.86782   136.473            1024
      1024   7.85202   136.747            1024
      2048   7.85979   136.612            1024
      4096   7.86932   136.447            1024
      8192    7.8509   136.767            1024
 
dd gives similar results:
capture4:~ # dd if=/dev/sdc1 of=/dev/null bs=512 count=2097152
2097152+0 records in
2097152+0 records out
1073741824 bytes (1.1 GB) copied, 7.85281 s, 137 MB/s

*Dion

Re: LVM write performance

Reply via email to