On 08/20/2011 12:53 AM, Stan Hoeppner wrote: > On 8/19/2011 4:38 PM, Dion Kant wrote: > >> I now think I understand the "strange" behaviour for block sizes not an >> integral multiple of 4096 bytes. (Of course you guys already knew the >> answer but just didn't want to make it easy for me to find the answer.) >> >> The newer disks today have a sector size of 4096 bytes. They may still >> be reporting 512 bytes, but this is to keep some ancient OS-es working. >> >> When a block write is not an integral of 4096 bytes, for example 512 >> bytes, 4095 or 8191 bytes, the driver must first read the sector, modify >> it and finally write it back to the disk. This explains the bi and the >> increased number of interrupts. >> >> I did some Google searches but did not find much. Can someone confirm >> this hypothesis? > > The read-modify-write performance penalty of unaligned partitions on the > "Advanced Format" drives (4KB native sectors) is a separate unrelated issue. > > As I demonstrated earlier in this thread, the performance drop seen when > using dd with block sizes less than 4KB affects traditional 512B/sector > drives as well. If one has a misaligned partition on an Advanced Format > drive, one takes a double performance hit when dd bs is less than 4KB. > > Again, everything in (x86) Linux is optimized around the 'magic' 4KB > size, including page size, filesystem block size, and LVM block size. Ok, I have done some browsing through the kernel sources. I understand the VFS a bit better now. When a read/write is issued on a block device file, the block size is 4096 bytes, i.e. reads/writes to the disk are done using blocks equal to the page cache size, i.e. the magic 4KB.
Submitting a request with a block size which is not an integral multiple of 4096 bytes results in a call to ll_rw_block(READ, 1, &bh), which reads in 4096 blocks, one by one into the page cache. This must be done before the user data can be used to partially update the concerning buffer page in the cache. After being updated, the buffer is flagged dirty and finally written to disk (8 sectors of 512 bytes). I found a nice debugging switch which helps monitoring the process. echo 1 > /proc/sys/vm/block_dump makes all bio requests being logged as kernel output. Example: dd of=/dev/vg/d1 if=/dev/zero bs=4095 count=2 conv=sync [ 239.977384] dd(6110): READ block 0 on dm-3 [ 240.026952] dd(6110): READ block 8 on dm-3 [ 240.027735] dd(6110): WRITE block 0 on dm-3 [ 240.027754] dd(6110): WRITE block 8 on dm-3 The ll_rw_block(READ, 1, &bh) is causing the reads which can be seen when monitoring with vmstat. The tests given below (as you requested) were carried out before I gained a better understanding of the VFS. However, remaining questions I still have are: 1. Why are the partial block updates (through ll_rw_block(READ, 1, &bh)) so dramatic slow as compared to other reads from the disk? 2. Furthermore remember the much better performance I reported when mounting a file system on the block device first, before accessing the disk through the block device file. If I find some more spare time I do some more digging in the kernel. Maybe I find out that then a different set of f_ops are used for accessing the raw block device by the Virtual Filesystem Switch. > > BTW, did you run your test with each of the elevators, as I recommended? > Do the following, testing dd after each change. > $ echo 128 > /sys/block/sdc/queue/read_ahead_kb dom0-2:~ # echo deadline > /sys/block/sdc/queue/scheduler dom0-2:~ # cat /sys/block/sdc/queue/scheduler noop [deadline] cfq dom0-2:~ # ./bw Writing 1 GB bs time rate (bytes) (s) (MiB/s) 512 54.0373 19.8704 1024 1024 54.2937 19.7765 1024 2048 52.1781 20.5784 1024 4096 13.751 78.0846 1024 8192 13.8519 77.5159 1024 dom0-2:~ # echo noop > /sys/block/sdc/queue/scheduler dom0-2:~ # cat /sys/block/sdc/queue/scheduler [noop] deadline cfq dom0-2:~ # ./bw Writing 1 GB bs time rate (bytes) (s) (MiB/s) 512 53.9634 19.8976 1024 1024 52.0421 20.6322 1024 2048 54.0437 19.868 1024 4096 13.9612 76.9088 1024 8192 13.8183 77.7043 1024 dom0-2:~ # echo cfq > /sys/block/sdc/queue/scheduler dom0-2:~ # cat /sys/block/sdc/queue/scheduler noop deadline [cfq] dom0-2:~ # ./bw Writing 1 GB bs time rate (bytes) (s) (MiB/s) 512 56.0087 19.171 1024 1024 56.345 19.0565 1024 2048 56.0436 19.159 1024 4096 15.1232 70.9999 1024 8192 15.4236 69.6168 1024 > > Also, just for fun, and interesting results, increase your read_ahead_kb > from the default 128 to 512. > > $ echo 512 > /sys/block/sdX/queue/read_ahead_kb $ echo deadline > /sys/block/sdX/queue/scheduler dom0-2:~ # ./bw Writing 1 GB bs time rate (bytes) (s) (MiB/s) 512 54.1023 19.8465 1024 1024 52.1824 20.5767 1024 2048 54.3797 19.7453 1024 4096 13.7252 78.2315 1024 8192 13.727 78.2211 1024 > $ echo noop > /sys/block/sdX/queue/scheduler dom0-2:~ # ./bw Writing 1 GB bs time rate (bytes) (s) (MiB/s) 512 54.0853 19.8527 1024 1024 54.525 19.6927 1024 2048 50.6829 21.1855 1024 4096 14.1272 76.0051 1024 8192 13.914 77.1701 1024 > $ echo cfq > /sys/block/sdX/queue/scheduler dom0-2:~ # ./bw Writing 1 GB bs time rate (bytes) (s) (MiB/s) 512 56.0274 19.1646 1024 1024 55.7614 19.256 1024 2048 56.5394 18.991 1024 4096 16.0562 66.8739 1024 8192 17.3842 61.7654 1024 Differences between deadline and noop are in the order of 2 to 3 % in favour of deadline. Remarkable is the run with the cfq elevator. It clearly has less performance, about 20% less (compared to the highest result) for the 512 read_ahead_kb case. Another try with the same settings: dom0-2:~ # ./bw Writing 1 GB bs time rate (bytes) (s) (MiB/s) 512 56.8122 18.8999 1024 1024 56.5486 18.9879 1024 2048 56.2555 19.0869 1024 4096 14.886 72.1311 1024 8192 15.461 69.4486 1024 so it looks like the previous result was at the low end of the statistical variation. > > These changes are volatile so a reboot clears them in the event you're > unable to change them back to the defaults for any reason. This is > easily avoidable if you simply cat the files and write down the values > before changing them. After testing, echo the default values back in. > I did some testing on a newer system with an *AOC-USAS-S4i Adaptec AACRAID* Controller on a Supermicro. It uses the aacraid driver. This controller supports RAID0,1,10 but with configuring the controller in a way that it published the disks as 4 single disk RAID0 to Linux (the controller cannot do JBOD), we obtained much better performance with Linux software RAID0, or striping with LVM or LVM on top of RAID0 as compared to RAID0 being managed by the controller. Now we obtain 300 to 350 MByte/s sustained write performance as about 150 MB/s when using the controller. *We use 4 ST32000644NS drives. Repeating the tests on this system gives similar results, let alone that the 2 TB drives have a better write performance of about 50%. capture4:~ # cat /sys/block/sdc/queue/read_ahead_kb 128 capture4:~ # cat /sys/block/sdc/queue/scheduler noop [deadline] cfq capture4:~ # ./bw /dev/sdc1 Writing 1 GB bs time rate (bytes) (s) (MiB/s) 8192 8.5879 125.03 1024 4096 8.54407 125.671 1024 2048 65.0727 16.5007 1024 Note the performance drop by a factor 1/8 halving the bs from 4096 to 2048. Reading a drive is 8.8% faster and works for all block sizes: capture4:~ # ./br /dev/sdc1 Reading 1 GB bs time rate (bytes) (s) (MiB/s) 512 7.86782 136.473 1024 1024 7.85202 136.747 1024 2048 7.85979 136.612 1024 4096 7.86932 136.447 1024 8192 7.8509 136.767 1024 dd gives similar results: capture4:~ # dd if=/dev/sdc1 of=/dev/null bs=512 count=2097152 2097152+0 records in 2097152+0 records out 1073741824 bytes (1.1 GB) copied, 7.85281 s, 137 MB/s *Dion