Hi Ralf, > > It is all committed now. > > Thanks. Some change between then and now caused differences in the > coreutils po files, though.
Yes, this is intentional. In the draft patch, I had special code to guarantee backward compatibility, namely in case of ambiguity, to choose the string with comes first in the PO file. Now, without this special code, it chooses the string which is a closer match in the 4-grams metric. This is not the first change in the behaviour of msgmerge: another such change was between versions 0.14.6 and 0.15, and no-one complained about it. > The shortcut ratios are now (in percentage of remaining pairs taken; > averaged over all .po files in coreutils not needing iconv): > > - zero length 1% > - string length difference 74% > - string length sum >= 20: 99% > - frequency bound: 66% > - premature compareseq termination: 98% Thanks for these figures. I'm adding them as comments to the code, to help future maintenance. > compareseq ranges at 86% of the total execution time, so if there are > more cheap bounds, more power to them. Yes... > I tried a 2-gram and 4-gram bound but their hit rate was around 1% so > I discarded that again (for the 4-gram that was to be expected due to > the indexing already done). Btw, why 4-grams? Why not 3-grams or 5-grams? I have not benchmarked it. > And yes, I can't help but ask a few more nitty questions again: ;-) > > > + #if CHAR_BIT <= 8 > > FWIW, CHAR_BIT cannot be smaller than 8. The intent is to disable this code when CHAR_BIT is 16 or 32. I could also have written '#if CHAR_BIT <= 10' but that sounds too much like nonsense. > FWIW2, not inlining the bound functions allows > - for less stack usage (in the frequency bound case), or We're near the end of the stack here: only compareseq and diag come below it. Anyway, 1 KB of stack space looks OK, the default stack size for threads being at least 16 KB. > - the compiler to decide about it (all modern compilers can do that), But the compiler does not know that fstrcmp is called millions of time and that this piece of code needs to be optimized for speed rather than for space. Does GCC meanwhile have a heuristic that forces inlining of a 'static' function that is only used once in the compilation unit? > - for easier profiling with gprof (when optimization is turned off), Sorry, I hate to write source code to care about deficiencies of a particular tool. gprof's deficiency is that it cannot profile below the function level. For profiling at the line or instruction level, you can use the Google profiler (Linux: <http://code.google.com/p/google-perftools/>) or Shark (MacOS X, part of Xcode: <http://developer.apple.com/tools/shark_optimize.html>). > - to eventually look into building a product metric out of them, for > "clustering" or so. Show me that you can do something with the "clustering" idea here, then I'm open to all patches that you send :-) > > + for (i = xvec_length - 1; i >= 0; i--) > > + occ_diff[(unsigned char) string1[i]]++; > > + /* Subtract the occurrence counts in Y. */ > > + for (i = yvec_length - 1; i >= 0; i--) > > + occ_diff[(unsigned char) string2[i]]--; > > Just curious: is it advantageous to traverse the strings backwards? I think the order should have no effect on the cache, since the cache lines will simply be traversed in opposite order. The next most important topic to look at, for speed, are the CPU instructions. And indeed, when the compiler does not apply loop strength reduction (i.e. rearrange the loop in completely unobvious ways) comparing i against 0 rather than against xvec_length saves 2 instructions inside the loop on x86, 1 instruction on SPARC. See the attached output, produced with gcc 4.3.0. > > + /* Sum up the absolute values. */ > > + sum = 0; > > + for (i = 0; i <= UCHAR_MAX; i++) > > + { > > + int d = occ_diff[i]; > > + sum += (d >= 0 ? d : -d); > > Curious again: is it advantageous to obscure abs like this? > Compilers know how to optimize abs, no? Indeed, all gcc versions since 2.95 at least inline an 'abs' call. I'm just over-cautious. Bruno --- lib/fstrcmp.c.orig 2008-09-20 15:29:49.000000000 +0200 +++ lib/fstrcmp.c 2008-09-20 15:16:14.000000000 +0200 @@ -100,6 +100,13 @@ gl_once_define(static, keys_init_once) +/* In the code below, branch probabilities were measured by Ralf Wildenhues, + by running "msgmerge LL.po coreutils.pot" with msgmerge 0.18 for many + values of LL. The probability indicates that the condition evaluates + to true; whether that leads to a branch or a non-branch in the code, + depends on the compiler's reordering of basic blocks. */ + + double fstrcmp_bounded (const char *string1, const char *string2, double lower_bound) { @@ -113,7 +120,7 @@ size_t bufmax; /* short-circuit obvious comparisons */ - if (xvec_length == 0 || yvec_length == 0) + if (xvec_length == 0 || yvec_length == 0) /* Prob: 1% */ return (xvec_length == 0 && yvec_length == 0 ? 1.0 : 0.0); if (lower_bound > 0) @@ -138,14 +145,14 @@ (double) (2 * MIN (xvec_length, yvec_length)) / (xvec_length + yvec_length); - if (upper_bound < lower_bound) + if (upper_bound < lower_bound) /* Prob: 74% */ /* Return an arbitrary value < LOWER_BOUND. */ return 0.0; #if CHAR_BIT <= 8 /* When X and Y are both small, avoid the overhead of setting up an array of size 256. */ - if (xvec_length + yvec_length >= 20) + if (xvec_length + yvec_length >= 20) /* Prob: 99% */ { /* Compute a less quick upper bound. Each edit is an insertion or deletion of a character, hence @@ -185,7 +192,7 @@ upper_bound = 1.0 - (double) sum / (xvec_length + yvec_length); - if (upper_bound < lower_bound) + if (upper_bound < lower_bound) /* Prob: 66% */ /* Return an arbitrary value < LOWER_BOUND. */ return 0.0; } @@ -245,7 +252,7 @@ /* Now do the main comparison algorithm */ ctxt.edit_count = - ctxt.edit_count_limit; - if (compareseq (0, xvec_length, 0, yvec_length, 0, &ctxt)) + if (compareseq (0, xvec_length, 0, yvec_length, 0, &ctxt)) /* Prob: 98% */ /* The edit_count passed the limit. Hence the result would be < lower_bound. We can return any value < lower_bound instead. */ return 0.0;
#include <limits.h> #include <stdlib.h> #include <string.h> void compute_forward (const char *string, unsigned int len, int occ[UCHAR_MAX + 1]) { int i; memset (occ, 0, (UCHAR_MAX + 1) * sizeof (int)); for (i = 0; i < len; i++) occ[(unsigned char) string[i]]++; } void compute_backward (const char *string, unsigned int len, int occ[UCHAR_MAX + 1]) { int i; memset (occ, 0, (UCHAR_MAX + 1) * sizeof (int)); for (i = len - 1; i >= 0; i--) occ[(unsigned char) string[i]]++; } int sumup_no_abs (int occ_diff[UCHAR_MAX + 1]) { int sum; int i; sum = 0; for (i = 0; i <= UCHAR_MAX; i++) { int d = occ_diff[i]; sum += (d >= 0 ? d : -d); } return sum; } int sumup_abs (int occ_diff[UCHAR_MAX + 1]) { int sum; int i; sum = 0; for (i = 0; i <= UCHAR_MAX; i++) sum += abs (occ_diff[i]); return sum; }
foo.s
Description: Binary data