On Mon, 30 May 2011 m...@freebsd.org wrote:
On Mon, May 30, 2011 at 8:25 AM, Bruce Evans <b...@optusnet.com.au> wrote:
On Sat, 28 May 2011 m...@freebsd.org wrote:
...
Meanwhile you could try setting ZERO_REGION_SIZE to PAGE_SIZE and I
think that will restore things to the original performance.
Using /dev/zero always thrashes caches by the amount <source buffer
size> + <target buffer size> (unless the arch uses nontemporal memory
accesses for uiomove, which none do AFAIK). ?So a large source buffer
is always just a pessimization. ?A large target buffer size is also a
pessimization, but for the target buffer a fairly large size is needed
to amortize the large syscall costs. ?In this PR, the target buffer
size is 64K. ?ZERO_REGION_SIZE is 64K on i386 and 2M on amd64. ?64K+64K
on i386 is good for thrashing the L1 cache.
That depends -- is the cache virtually or physically addressed? The
zero_region only has 4k (PAGE_SIZE) of unique physical addresses. So
most of the cache thrashing is due to the user-space buffer, if the
cache is physically addressed.
Oops. I now remember thinking that the much larger source buffer would be
OK since it only uses 1 physical page. But it is apparently virtually
addressed.
?It will only have a
noticeable impact on a current L2 cache in competition with other
threads. ?It is hard to fit everything in the L1 cache even with
non-bloated buffer sizes and 1 thread (16 for the source (I)cache, 0
for the source (D)cache and 4K for the target cache might work). ?On
amd64, 2M+2M is good for thrashing most L2 caches. ?In this PR, the
thrashing is limited by the target buffer size to about 64K+64K, up
from 4K+64K, and it is marginal whether the extra thrashing from the
larger source buffer makes much difference.
The old zbuf source buffer size of PAGE_SIZE was already too large.
Wouldn't this depend on how far down from the use of the buffer the
actual copy happens? Another advantage to a large virtual buffer is
that it reduces the number of times the copy loop in uiomove has to
return up to the device layer that initiated the copy. This is all
pretty fast, but again assuming a physical cache fewer trips is
better.
Yes, I had forgotten that I have to keep going back to the uiomove()
level for each iteration. That's a lot of overhead although not nearly
as much as going back to the user level. If this is actually important
to optimize, then I might add a repeat count to uiomove() and copyout()
(actually a different function for the latter).
linux-2.6.10 uses a mmapped /dev/zero and has had this since Y2K
according to its comment. Sigh. You will never beat that by copying,
but I think mmapping /dev/zero is only much more optimal for silly
benchmarks.
linux-2.6.10 also has a seekable /dev/zero. Seeks don't really work,
but some of them "succeed" and keep the offset at 0 . ISTR remember
a FreeBSD PR about the file offset for /dev/zero not "working" because
it is garbage instead of 0. It is clearly a Linuxism to depend on it
being nonzero. IIRC, the file offset for device files is at best
implementation-defined in POSIX.
Bruce
_______________________________________________
svn-src-head@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/svn-src-head
To unsubscribe, send any mail to "svn-src-head-unsubscr...@freebsd.org"