Modern "multimedia" vectorized hardware instructions can speed deflate(). For higher-end x86* CPUs the speedup might be 2% to 3% of total CPU time. On a slower CPU, or with a compiler plus instruction decoder that suffer longer latency after a branch (such as gcc for some PowerPC chips) then the improvement might be 5% to 8%.
The attached patch introduces a new subroutine slide_Pos() in deflate.c which identifies the operation that is subject to optimization. The opportunity arises when sliding the window. The vectors head[] and prev[] of substring indices are adjusted using saturating subtraction. A very good compiler should be able to recognize and vectorize the operation from the patched source. If not, then any compiler which can inline a local subroutine should give code which is no worse than the unmodified version. A compiler which does not inline slide_Pos might introduce a penalty approximately equal to the cost of two internal subroutine calls. If there is interest, then I will follow with assembly-language versions of slide_Pos for i686/x86_64 (with runtime selection among several variants according to actual hardware capabilities), PowerPC altivec (compile-time selection) and ARM neon (compile-time selection.) -- John Reiser, jrei...@bitwagon.com
>From 4939c190a79ea001c87aac736934268c077997ce Mon Sep 17 00:00:00 2001 From: John Reiser <jrei...@bitwagon.com> Date: Mon, 23 Jul 2012 10:25:52 -0700 Subject: [PATCH 2/2] slide_Pos(): identify for future optimization --- deflate.c | 26 ++++++++++++++++++-------- 1 file changed, 18 insertions(+), 8 deletions(-) diff --git a/deflate.c b/deflate.c index 5405f10..d553f83 100644 --- a/deflate.c +++ b/deflate.c @@ -519,6 +519,22 @@ local void check_match(start, match, length) # define check_match(start, match, length) #endif +/* Adjust an array of Pos when the window slides. + * The operation corresponds to saturating subtract (subtract, but + * do not allow the answer to go below zero.) This is supported + * by vectorized "multimedia" instructions of various machines + * such as x86 MMX and SSE2, PowerPC altivec, ARM neon. + */ +local void slide_Pos(ptr, n, slide) + Pos *const ptr; unsigned const n; Pos const slide; +{ + unsigned j; + for (j = 0; j < n; j++) { + Pos const m = ptr[j]; + ptr[j] = (Pos)(m >= slide ? m-slide : NIL); + } +} + /* =========================================================================== * Fill the window when the lookahead becomes insufficient. * Updates strstart and lookahead, and sets eofile if end of input file. @@ -553,17 +569,11 @@ local void fill_window() block_start -= (long) WSIZE; - for (n = 0; n < HASH_SIZE; n++) { - m = head[n]; - head[n] = (Pos)(m >= WSIZE ? m-WSIZE : NIL); - } - for (n = 0; n < WSIZE; n++) { - m = prev[n]; - prev[n] = (Pos)(m >= WSIZE ? m-WSIZE : NIL); + slide_Pos(head, HASH_SIZE, WSIZE); + slide_Pos(prev, WSIZE, WSIZE); /* If n is not on any hash chain, prev[n] is garbage but * its value will never be used. */ - } more += WSIZE; } /* At this point, more >= 2 */ -- 1.7.10.4