gzip's i386 assembly code, activated by default in the FreeBSD source tree,
produces poor performance on an i686 core (PPro/P2/P3). This is due to the
'partial register stall' problem, explained in a URL recently brought up on
the list, http://www.emulators.com/pentium4.htm.

In the course of learning more about partial register stalls I came across
the following i686 and i586 assembly optimizations for gzip:
http://www.muppetlabs.com/~breadbox/software/assembly.html.

This optimized i686 asm avoids partial reg stall and is between 20-40%
faster, with higher compression levels achieving greater benefit from the
patch. The i586 patch is usually only 5% faster, but in some cases achieves
a 25% speedup.

For completeness, I also ran some tests on a non-asm gcc 2.95.2 compile,
with and without -march=pentiumpro. Here are the results (three runs,
averaged, caches warmed with some throwaway runs) on a Pentium II 400,
linux-2.4.2.tar, --best.

                       [type]  [user secs]     [time (as % of slowest)]
                     i386 asm:    175               100%
                   no asm, -O:    142              81.1%
                  no asm, -O2:    139              79.4%
 no asm, -O -march=pentiumpro:    136              77.7%
no asm, -O2 -march=pentiumpro:    140              80.0%
                     i686 asm:    124              70.8%

I'm interested in other people's results/tests. Particularly, I should do
some runs with -mcpu=pentiumpro as well.

An important part of the equation is to make sure it doesn't hurt i586
machines. I did several tests on a Pentium 200MMX; the i386 asm and the
gcc-emitted asm are not measurably different on that CPU.

Brian Raiter ([EMAIL PROTECTED], author of the i586/i686 asm patches)
has contacted the gzip maintainers, but it's been years since a release and
there may not be another gzip release. I have seen a 1.2.4a release which
had his files in a contrib/ directory, but they were not active in any way.

Since I would imagine a large percentage of FreeBSD users run on i686
cores, it'd be great to get this pretty significant speed increase into our
tree.

The i686 patch is neat (30% faster!) but its improvement over gcc's emitted
assembly is small. Disabling the old i386 assembly seems a good first
step. Attached is a patch that disables the custom asm.

I'm interested in hearing everyone's comments.

Aaron
Index: Makefile
===================================================================
RCS file: /usr/cvs/src/gnu/usr.bin/gzip/Makefile,v
retrieving revision 1.21
diff -u -r1.21 Makefile
--- Makefile    1999/08/27 23:35:48     1.21
+++ Makefile    2001/03/20 23:59:48
@@ -8,11 +8,6 @@
 CFLAGS+=-DSTDC_HEADERS=1 -DHAVE_UNISTD_H=1 -DDIRENT=1
 GREP_LIBZ?=    YES
 
-.if ${MACHINE_ARCH} == "i386"
-SRCS+= match.S
-CFLAGS+=-DASMV
-.endif
-
 MLINKS= gzip.1 gunzip.1  gzip.1 zcat.1  gzip.1 gzcat.1
 MLINKS+= zdiff.1 zcmp.1
 

Reply via email to