On Sat, 23 Dec 2006, I wrote:
The problem becomes smaller as the read block size appoaches the file
system block size and vanishes when the sizes are identical. Then
there is apparently a different (smaller) problem:
Read size 16K, random:
%%%
granularity: each sample hit covers 16 byte(s) for 0.00% of 1.15 seconds
% cumulative self self total
time seconds seconds calls ns/call ns/call name
49.1 0.565 0.565 25643 22037 22037 copyout [11]
12.6 0.710 0.145 0 100.00% mcount [14]
8.8 0.811 0.101 87831 1153 1153 vm_page_splay [17]
7.0 0.892 0.081 112906 715 715 buf_splay [19]
6.1 0.962 0.070 0 100.00% mexitcount [20]
3.4 1.000 0.039 0 100.00% cputime [22]
1.2 1.013 0.013 86883 153 181 vm_page_unwire [28]
1.1 1.027 0.013 0 100.00% user [29]
1.1 1.040 0.013 21852 595 3725 getnewbuf [18]
%%%
Read size 16K, sequential:
%%%
granularity: each sample hit covers 16 byte(s) for 0.00% of 0.96 seconds
% cumulative self self total
time seconds seconds calls ns/call ns/call name
57.1 0.550 0.550 25643 21464 21464 copyout [11]
14.2 0.687 0.137 0 100.00% mcount [12]
6.9 0.754 0.066 0 100.00% mexitcount [15]
4.2 0.794 0.040 102830 391 391 vm_page_splay [19]
3.8 0.830 0.037 0 100.00% cputime [20]
1.4 0.844 0.013 102588 130 130 buf_splay [22]
1.3 0.856 0.012 25603 488 1920 getnewbuf [17]
1.0 0.866 0.009 25606 368 368 pmap_qremove [24]
%%%
Now the splay routines are called almost the same number of times, but
take much longer in the random case. buf_splay() seems to be unrelated
to vm -- it is called from gbincore() even if the buffer is already
in the buffer cache. It seems quite slow for that -- almost 1 uS just
to look up compared with 21 uS to copyout a 16K buffer. Linux-sized
buffers would take only 1.5 uS and then 1 uS to look them up is clearly
too much. Another benchmark shows gbincore() taking 501 nS each to
look up 64 in-buffer-cache buffers for 1MB file -- this must be the
best case for it (all these times are for -current on an Athlon XP2700
overclocked to 2025MHz). The generic hash function used in my compiler
takes 40 nS to hash a 16-byte string on this machine.
FreeBSD-~4.10 is faster. The difference is especially noticeable when
the read size is the same as the fs block size (16K, as above). Then
I get the following speeds:
~4.10, random: 580MB/S
~4.10, sequential: 580MB/S
~5.2, random: 575MB/S
~5.2, sequential: 466MB/S
All with kernel profiling not configured, and no INVARIANTS etc.
~5.2 is quite different from -current, but it has buf_splay() and
vm_page_splay(), and behaves similarly in this benchmark.
With profiling ~4.10, read size 16K, sequential +some random:
%%%
% cumulative self self total
time seconds seconds calls ns/call ns/call name
51.1 0.547 0.547 25643 21323 21323 generic_copyout [9]
17.3 0.732 0.185 0 100.00% mcount [10]
7.9 0.817 0.085 0 100.00% mexitcount [13]
5.0 0.870 0.053 0 100.00% cputime [16]
1.9 0.891 0.020 51207 395 395 gbincore [20]
(424 for random)
1.4 0.906 0.015 102418 150 253 vm_page_wire [18]
(322)
1.3 0.920 0.014 231218 62 62 splvm [23]
1.3 0.934 0.014 25603 541 2092 allocbuf [15]
(2642)
1.0 0.945 0.010 566947 18 18 splx <cycle 1> [25]
1.0 0.955 0.010 102122 100 181 vm_page_unwire [21]
0.9 0.964 0.009 25606 370 370 pmap_qremove [27]
0.9 0.973 0.009 25603 359 2127 getnewbuf [14]
(2261)
%%%
There is little difference for the sequential case, but the old gbincore()
and buffer allocation routines are much faster for the random case.
With profiling ~4.10, read size 4K, random:
%%%
granularity: each sample hit covers 16 byte(s) for 0.00% of 2.63 seconds
% cumulative self self total
time seconds seconds calls ns/call ns/call name
27.3 0.720 0.720 0 100.00% mcount [8]
22.5 1.312 0.592 102436 5784 5784 generic_copyout [10]
12.6 1.643 0.331 0 100.00% mexitcount [13]
7.9 1.850 0.207 0 100.00% cputime [15]
2.9 1.926 0.076 189410 402 402 gbincore [20]
2.3 1.988 0.061 348029 176 292 vm_page_wire [18]
2.2 2.045 0.058 87010 662 2500 allocbuf [14]
2.0 2.099 0.053 783280 68 68 splvm [22]
1.6 2.142 0.043 0 99.33% user [24]
1.6 2.184 0.042 2041759 20 20 splx <cycle 3> [26]
1.3 2.217 0.034 347298 97 186 vm_page_unwire [21]
1.2 2.249 0.032 86895 370 370 pmap_qremove [28]
1.1 2.279 0.029 87006 337 2144 getnewbuf [16]
0.9 2.303 0.024 86891 280 1617 vfs_vmio_release [17]
%%%
Now the result is little different from -current -- the random case is
almost as slow as in -current according to the total time, although this
may be an artifact of profiling (allocbuf takes 2500 nS total in ~4.10
vs 4025 nS in -current).
Bruce
_______________________________________________
freebsd-performance@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-performance
To unsubscribe, send any mail to "[EMAIL PROTECTED]"