The following mail arrived at debbugs.gnu.org via bcc, so could not be associated with a package. I guess it was meant for grep, so I have reassigned it and am sending this mail which will appear on the grep mailing list.
sur-behoffski wrote: > G'day, > > I've been working on an "exploratory fork" of GNU Grep, in order > to provide a framework where I can set up a pared-down program > that provides support for some simple search cases, but does not > have the daunting complexity and (to some extent) extensive > legacy support baggage associated with GNU Grep. > > This has resulted in a new project, "hstbm", which is an acronym > for Hybrid Self-Tuning Boyer-Moore. > > The Savannah folks have kindly agreed to host hstbm as a non-GNU > project at: > > https://savannah.nongnu.org/projects/hstbm > > The release can be downloaded as a separate tarball from: > > http://git.savannah.gnu.org/cgit/hstbm.git/snapshot/hstbm-0.13.tar.gz > > Git users probably want to clone the repository instead; the > latest sources (which may include changes after the 0.13 release) > can be retrieved via "git clone": > > git clone git://git.savannah.nongnu.org/hstbm.git > > ------------- > > hstbm is a partner to the "untangle" script that I've been posting > to the GNU Grep over time; whereas "untangle" tries to lay the > foundations for a "bottom-up" refactoring of GNU Grep code, hstbm > is a "top-down" approach, reducing the layers of code between the > caller, pattern interpretation (lexing and/or parsing), and the > underlying search algorithm -- which is also called hstbm. > > Only a small fraction of regular expression syntax -- classes -- > is recognised, and the class support is competent but not > bullet-proof. It's best to think of the current level of pattern > support as "fgrep (or grep -F) plus some class support". This > level of support is enough to fulfil the needs of the string > search algorithm included in this release; more support, such as > anchors, optional elements, repetition, backreferences etc. are > not included for now (although there's a nod in one or two places > that these could included in the future). > > ------------- > > The hstbm algorithm itself is competitive with the equivalent > algorithm in GNU Grep (culminating in bmexec () in src/kwset.c). > In my limited testing, choosing data and patterns that are > somewhat biased towards showing hstbm's strong points, it can > outperform GNU Grep. I'm hopeful that these improvements can > hold up over a very wide range of data and patterns, with very > few or no performance regressions, such that it might be > worthwhile incorporating hstbm into GNU Grep: > > ["repl" is a Lua script to wrap (roughly) the following shell > script, run it multiple times (throwing the first iteration > away), and log the results, as well as outputting them: > > Cmd=$* > time (let j=1 \ > while [ $((j++)) -lt 1000 ] ; do \ > $Cmd >/dev/null \ > done \ > ) 2>&1 > ] > > -------------- > > $ grep --version | head 1 > grep (GNU grep) 2.21 > $ ./h --version | head -1 > hstbm (Hybrid Self-Tuning Boyer-Moore) 0.13 > > > $ cat /usr/src/linux-3.14.32-gentoo/*/*.[ch] >csrc > $ wc csrc > 420778 1348027 11521841 csrc > > > $ ./repl grep -c "123[abc]456" csrc > Cmd: "grep" "-c" "123[abc]456" "csrc" > 0 > real 0m7.268s user 0m3.140s sys 0m3.767s > real 0m7.299s user 0m3.317s sys 0m3.590s > real 0m7.298s user 0m3.253s sys 0m3.657s > real 0m7.245s user 0m3.233s sys 0m3.673s > > $ ./repl ./h -c "123[abc]456" csrc > Cmd: "./h" "-c" "123[abc]456" "csrc" > 0 > real 0m6.570s user 0m2.237s sys 0m1.327s > real 0m6.586s user 0m2.200s sys 0m1.383s > real 0m6.647s user 0m2.317s sys 0m1.367s > real 0m6.598s user 0m2.247s sys 0m1.367s > > $ ./repl grep -ci "123[abc]456" csrc > Cmd: "grep" "-ci" "123[abc]456" "csrc" > 0 > real 0m7.307s user 0m3.163s sys 0m3.743s > real 0m7.269s user 0m3.373s sys 0m3.533s > real 0m7.266s user 0m3.273s sys 0m3.633s > real 0m7.266s user 0m3.357s sys 0m3.553s > > $ ./repl ./h -ci "123[abc]456" csrc > Cmd: "./h" "-ci" "123[abc]456" "csrc" > 0 > real 0m6.582s user 0m2.250s sys 0m1.313s > real 0m6.589s user 0m2.287s sys 0m1.293s > real 0m6.629s user 0m2.197s sys 0m1.370s > real 0m6.637s user 0m2.217s sys 0m1.357s > > $ ./repl grep -c "123456[abc]" csrc > Cmd: "grep" "-c" "123456[abc]" "csrc" > 0 > real 0m6.652s user 0m2.170s sys 0m1.407s > real 0m6.659s user 0m2.257s sys 0m1.330s > real 0m6.662s user 0m2.130s sys 0m1.450s > real 0m6.658s user 0m2.157s sys 0m1.430s > > $ ./repl ./h -c "123456[abc]" csrc > Cmd: "./h" "-c" "123456[abc]" "csrc" > 0 > real 0m6.584s user 0m2.227s sys 0m1.337s > real 0m6.546s user 0m2.210s sys 0m1.360s > real 0m6.546s user 0m2.180s sys 0m1.393s > real 0m6.544s user 0m2.210s sys 0m1.363s > > $ ./repl grep -ci "123456[abc]" csrc > Cmd: "grep" "-ci" "123456[abc]" "csrc" > 0 > real 0m6.676s user 0m2.177s sys 0m1.407s > real 0m6.681s user 0m2.060s sys 0m1.537s > real 0m6.671s user 0m2.250s sys 0m1.337s > real 0m6.683s user 0m2.090s sys 0m1.497s > > $ ./repl ./h -ci "123456[abc]" csrc > Cmd: "./h" "-ci" "123456[abc]" "csrc" > 0 > real 0m6.644s user 0m2.333s sys 0m1.233s > real 0m6.689s user 0m2.113s sys 0m1.457s > real 0m6.602s user 0m2.187s sys 0m1.383s > real 0m6.604s user 0m2.220s sys 0m1.353s > > $ ./repl grep -c "123456a" csrc > Cmd: "grep" "-c" "123456a" "csrc" > 0 > real 0m12.035s user 0m7.743s sys 0m2.490s > real 0m12.024s user 0m7.763s sys 0m2.473s > real 0m12.019s user 0m7.820s sys 0m2.413s > real 0m12.017s user 0m7.717s sys 0m2.517s > > $ ./repl ./h -c "123456a" csrc > Cmd: "./h" "-c" "123456a" "csrc" > 0 > real 0m6.266s user 0m1.923s sys 0m1.640s > real 0m6.273s user 0m1.980s sys 0m1.590s > real 0m6.252s user 0m2.217s sys 0m1.357s > real 0m6.254s user 0m2.033s sys 0m1.540s > > $ ./repl grep -ci "123456a" csrc > Cmd: "grep" "-ci" "123456a" "csrc" > 0 > real 0m15.220s user 0m11.060s sys 0m2.507s > real 0m15.327s user 0m10.980s sys 0m2.583s > real 0m15.215s user 0m11.090s sys 0m2.473s > real 0m15.218s user 0m10.900s sys 0m2.663s > > $ ./repl ./h -ci "123456a" csrc > Cmd: "./h" "-ci" "123456a" "csrc" > 0 > real 0m6.651s user 0m2.307s sys 0m1.260s > real 0m6.665s user 0m2.253s sys 0m1.327s > real 0m6.602s user 0m2.320s sys 0m1.243s > real 0m6.611s user 0m2.243s sys 0m1.327s > > > > $ wc candide-8859-1.txt > 4352 35591 221126 candide-8859-1.txt > > > $ LC_ALL=fr_FR.iso88591 ./repl grep -c "tr[[=e=]]s" candide-8859-1.txt > Cmd: "grep" "-c" "tr[[=e=]]s" "candide-8859-1.txt" > 154 > real 0m4.330s user 0m3.240s sys 0m0.290s > real 0m4.340s user 0m3.273s sys 0m0.263s > real 0m4.345s user 0m3.213s sys 0m0.323s > real 0m4.346s user 0m3.213s sys 0m0.327s > > $ LC_ALL=fr_FR.iso88591 ./repl ./h -c "tr[[=e=]]s" candide-8859-1.txt > Cmd: "./h" "-c" "tr[[=e=]]s" "candide-8859-1.txt" > 154 > real 0m1.376s user 0m0.043s sys 0m0.133s > real 0m1.382s user 0m0.073s sys 0m0.103s > real 0m1.383s user 0m0.043s sys 0m0.133s > real 0m1.386s user 0m0.050s sys 0m0.127s > > $ LC_ALL=fr_FR.iso88591 ./repl grep -ci "tr[[=e=]]s" candide-8859-1.txt > Cmd: "grep" "-ci" "tr[[=e=]]s" "candide-8859-1.txt" > 155 > real 0m5.704s user 0m3.357s sys 0m0.177s > real 0m5.683s user 0m3.383s sys 0m0.150s > real 0m5.686s user 0m3.343s sys 0m0.190s > real 0m5.690s user 0m3.363s sys 0m0.173s > > $ LC_ALL=fr_FR.iso88591 ./repl ./h -ci "tr[[=e=]]s" candide-8859-1.txt > Cmd: "./h" "-ci" "tr[[=e=]]s" "candide-8859-1.txt" > 155 > real 0m1.598s user 0m0.043s sys 0m0.137s > real 0m1.600s user 0m0.060s sys 0m0.117s > real 0m1.608s user 0m0.040s sys 0m0.140s > real 0m1.609s user 0m0.070s sys 0m0.110s > > > -------------------- > > > While many, many features of GNU Grep are not supported, two switches, > -J and -K=SKIP_LOOP_EFFICIENCY_THRESHOLD, are included. Here's an > example of the debug output created when -J is specified: > > > > > $ ./h -J -c "tr[[=e=]]s" candide-8859-1.txt > * show_internals: Compiled pattern: Number of tokens: 4 > OP_CHAR_DIRECT('t') -- 0x74 > OP_CHAR_DIRECT('r') -- 0x72 > OP_CHAR_DIRECT('e') -- 0x65 > OP_CHAR_DIRECT('s') -- 0x73 > > * debug: hstbm context at 0x20ee000: > - fold_table (not used): > ................................ ................................ > ................................ ................................ > ................................ ................................ > ................................ ................................ > - delta1: > 44444444444444444444444444444444 44444444444444444444444444444444 > 44444444444444444444444444444444 44444144444444444420344444444444 > 44444444444444444444444444444444 44444444444444444444444444444444 > 44444444444444444444444444444444 44444444444444444444444444444444 > - Pattern: t r e s > > * Hybrid (memchr/memchr2) efficiency variables: > - skip_loop_efficiency_threshold: 80 > - nr_aliases: 0 0 0 0 > - first_alias: . . . . > - memchr_offset_list [negated]: 3 2 1 0 > - current_memchr_offset_index: 0 > > * Guard (delta2) and match (delta 3/4) skip variables: > - delta2 candidate(s): 4 > - delta2: 4 > - delta3: 4 4 4 > - delta4: 4 4 > > * context check... [OK] > 93 > $ ./h -J -ci "tr[[=e=]]s" candide-8859-1.txt > * show_internals: Compiled pattern: Number of tokens: 4 > OP_CHAR_CLASS(ref=1) -- members:2 > -- 54 74 Tt > OP_CHAR_CLASS(ref=2) -- members:2 > -- 52 72 Rr > OP_CHAR_CLASS(ref=3) -- members:2 > -- 45 65 Ee > OP_CHAR_CLASS(ref=4) -- members:2 > -- 53 73 Ss > > * debug: hstbm context at 0x1a4e000: > - fold_table: > ................................ !"#$%&'()*+,-./0123456789:;<=>? > @ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_ `abcdEfghijklmnopqRsTuvwxyz{|}~. > ................................ ................................ > ................................ ................................ > - delta1: > 44444444444444444444444444444444 44444444444444444444444444444444 > 44444144444444444420344444444444 44444144444444444420344444444444 > 44444444444444444444444444444444 44444444444444444444444444444444 > 44444444444444444444444444444444 44444444444444444444444444444444 > - Pattern: T R E S > > * Hybrid (memchr/memchr2) efficiency variables: > - skip_loop_efficiency_threshold: 80 > - nr_aliases: 1 1 1 1 > - first_alias: t r e s > - memchr_offset_list [negated]: 3 2 1 0 > - current_memchr_offset_index: 0 > > * Guard (delta2) and match (delta 3/4) skip variables: > - delta2 candidate(s): 4 > - delta2: 4 > - delta3: 4 4 4 > - delta4: 4 4 > > * context check... [OK] > 93 > $ LC_ALL=fr_FR.iso88591 ./h -J -ci "tr[[=e=]]s" candide-8859-1.txt > * show_internals: Compiled pattern: Number of tokens: 4 > OP_CHAR_CLASS(ref=1) -- members:2 > -- 54 74 Tt > OP_CHAR_CLASS(ref=2) -- members:2 > -- 52 72 Rr > OP_CHAR_CLASS(ref=3) -- members:10 > -- 45 65 c8 c9 ca cb e8 e9 ea eb Eeà à à à èé êë > OP_CHAR_CLASS(ref=4) -- members:2 > -- 53 73 Ss > > * debug: hstbm context at 0x1ae7000: > - fold_table: > ................................ !"#$%&'()*+,-./0123456789:;<=>? > @ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_ `abcdEfghijklmnopqRsTuvwxyz{|}~. > ................................ > ¡¢£¤¥¦§¨©ª«¬Â®¯°±²³´µ¶·¸¹º»¼½¾¿ > à Ãà à à à à à EEEEà Ãà ÃÃÃ'Ã'Ã"Ã"à à à à à à à à Ãà à à > áâãäåæçEEEEìÃîïðñòóôõö÷øùúûüýþÿ > - delta1: > 44444444444444444444444444444444 44444444444444444444444444444444 > 44444144444444444420344444444444 44444144444444444420344444444444 > 44444444444444444444444444444444 44444444444444444444444444444444 > 44444444111144444444444444444444 44444444111144444444444444444444 > - Pattern: T R E S > > * Hybrid (memchr/memchr2) efficiency variables: > - skip_loop_efficiency_threshold: 80 > - nr_aliases: 1 1 9 1 > - first_alias: t r e s > - memchr_offset_list [negated]: 3 2 0 > - current_memchr_offset_index: 0 > > * Guard (delta2) and match (delta 3/4) skip variables: > - delta2 candidate(s): 4 > - delta2: 4 > - delta3: 4 4 4 > - delta4: 4 4 > > * context check... [OK] > 155 > $ > > > > -------------------- > > > > WARNING: Because hstbm supports far fewer features than GNU Grep, > it takes less time to initialise. This will give hstbm an advantage > when the total input is small (?? perhaps less than 4k bytes?), and > should be factored out when comparing performance. (Note that both > csrc and candide-8859-1.txt above are greater than 100k bytes, so > this effect should not be a large factor in the above results.) > > > $ ./repl grep -c "123[abc]456" /dev/null > Cmd: "grep" "-c" "123[abc]456" "/dev/null" > 0 > real 0m0.990s user 0m0.073s sys 0m0.097s > real 0m1.001s user 0m0.050s sys 0m0.123s > real 0m0.993s user 0m0.020s sys 0m0.147s > real 0m0.989s user 0m0.040s sys 0m0.127s > > $ ./repl ./h -c "123[abc]456" /dev/null > Cmd: "./h" "-c" "123[abc]456" "/dev/null" > 0 > real 0m0.748s user 0m0.047s sys 0m0.103s > real 0m0.758s user 0m0.057s sys 0m0.093s > real 0m0.762s user 0m0.037s sys 0m0.113s > real 0m0.754s user 0m0.050s sys 0m0.100s > > $ ./repl grep -ci "123[abc]456" /dev/null > Cmd: "grep" "-ci" "123[abc]456" "/dev/null" > 0 > real 0m0.993s user 0m0.043s sys 0m0.127s > real 0m0.998s user 0m0.057s sys 0m0.110s > real 0m0.995s user 0m0.037s sys 0m0.130s > real 0m0.993s user 0m0.053s sys 0m0.110s > > $ ./repl ./h -ci "123[abc]456" /dev/null > Cmd: "./h" "-ci" "123[abc]456" "/dev/null" > 0 > real 0m0.768s user 0m0.037s sys 0m0.113s > real 0m0.787s user 0m0.033s sys 0m0.123s > real 0m0.770s user 0m0.037s sys 0m0.110s > real 0m0.779s user 0m0.027s sys 0m0.127s > > > > -------------------- > > > > Have fun hacking, and reading the code... all feedback, comments, > patches, etc. welcome. > > cheers, > > sur-behoffski (Brenton Hoff) > Programmer, Grouse Software