Modern "multimedia" vectorized hardware instructions can speed deflate().
For higher-end x86* CPUs the speedup might be 2% to 3% of total CPU time.
On a slower CPU, or with a compiler plus instruction decoder that suffer
longer latency after a branch (such as gcc for some PowerPC chips)
then the improvement might be 5% to 8%.

The attached patch introduces a new subroutine slide_Pos() in deflate.c
which identifies the operation that is subject to optimization.
The opportunity arises when sliding the window.  The vectors head[]
and prev[] of substring indices are adjusted using saturating subtraction.
A very good compiler should be able to recognize and vectorize the operation
from the patched source.  If not, then any compiler which can inline a local
subroutine should give code which is no worse than the unmodified version.
A compiler which does not inline slide_Pos might introduce a penalty
approximately equal to the cost of two internal subroutine calls.

If there is interest, then I will follow with assembly-language versions
of slide_Pos for i686/x86_64 (with runtime selection among several variants
according to actual hardware capabilities), PowerPC altivec (compile-time
selection) and ARM neon (compile-time selection.)

-- 
John Reiser, jrei...@bitwagon.com
>From 4939c190a79ea001c87aac736934268c077997ce Mon Sep 17 00:00:00 2001
From: John Reiser <jrei...@bitwagon.com>
Date: Mon, 23 Jul 2012 10:25:52 -0700
Subject: [PATCH 2/2] slide_Pos(): identify for future optimization

---
 deflate.c |   26 ++++++++++++++++++--------
 1 file changed, 18 insertions(+), 8 deletions(-)

diff --git a/deflate.c b/deflate.c
index 5405f10..d553f83 100644
--- a/deflate.c
+++ b/deflate.c
@@ -519,6 +519,22 @@ local void check_match(start, match, length)
 #  define check_match(start, match, length)
 #endif
 
+/* Adjust an array of Pos when the window slides.
+ * The operation corresponds to saturating subtract (subtract, but
+ * do not allow the answer to go below zero.)  This is supported
+ * by vectorized "multimedia" instructions of various machines
+ * such as x86 MMX and SSE2, PowerPC altivec, ARM neon.
+ */
+local void slide_Pos(ptr, n, slide)
+    Pos *const ptr; unsigned const n; Pos const slide;
+{
+    unsigned j;
+    for (j = 0; j < n; j++) {
+        Pos const m = ptr[j];
+        ptr[j] = (Pos)(m >= slide ? m-slide : NIL);
+    }
+}
+
 /* ===========================================================================
  * Fill the window when the lookahead becomes insufficient.
  * Updates strstart and lookahead, and sets eofile if end of input file.
@@ -553,17 +569,11 @@ local void fill_window()
 
         block_start -= (long) WSIZE;
 
-        for (n = 0; n < HASH_SIZE; n++) {
-            m = head[n];
-            head[n] = (Pos)(m >= WSIZE ? m-WSIZE : NIL);
-        }
-        for (n = 0; n < WSIZE; n++) {
-            m = prev[n];
-            prev[n] = (Pos)(m >= WSIZE ? m-WSIZE : NIL);
+        slide_Pos(head, HASH_SIZE, WSIZE);
+        slide_Pos(prev,     WSIZE, WSIZE);
             /* If n is not on any hash chain, prev[n] is garbage but
              * its value will never be used.
              */
-        }
         more += WSIZE;
     }
     /* At this point, more >= 2 */
-- 
1.7.10.4

Reply via email to