Re: msgmerge speedup: fstrcmp and diffseq improvements

Bruno Haible Sat, 20 Sep 2008 07:20:23 -0700

Hi Ralf,

> > It is all committed now.
> 
> Thanks.  Some change between then and now caused differences in the
> coreutils po files, though.


Yes, this is intentional. In the draft patch, I had special code to
guarantee backward compatibility, namely in case of ambiguity, to choose
the string with comes first in the PO file. Now, without this special code,
it chooses the string which is a closer match in the 4-grams metric.

This is not the first change in the behaviour of msgmerge: another such
change was between versions 0.14.6 and 0.15, and no-one complained about it.

> The shortcut ratios are now (in percentage of remaining pairs taken;
> averaged over all .po files in coreutils not needing iconv):
> 
> - zero length 1%
> - string length difference 74%
> - string length sum >= 20: 99%
> - frequency bound: 66%
> - premature compareseq termination: 98%

Thanks for these figures. I'm adding them as comments to the code, to help
future maintenance.

> compareseq ranges at 86% of the total execution time, so if there are
> more cheap bounds, more power to them.

Yes...

> I tried a 2-gram and 4-gram bound but their hit rate was around 1% so
> I discarded that again (for the 4-gram that was to be expected due to
> the indexing already done).

Btw, why 4-grams? Why not 3-grams or 5-grams? I have not benchmarked it.

> And yes, I can't help but ask a few more nitty questions again:  ;-)
> 
> > + #if CHAR_BIT <= 8
> 
> FWIW, CHAR_BIT cannot be smaller than 8.

The intent is to disable this code when CHAR_BIT is 16 or 32.
I could also have written '#if CHAR_BIT <= 10' but that sounds too much
like nonsense.

> FWIW2, not inlining the bound functions allows
> - for less stack usage (in the frequency bound case), or

We're near the end of the stack here: only compareseq and diag come below it.
Anyway, 1 KB of stack space looks OK, the default stack size for threads being
at least 16 KB.

> - the compiler to decide about it (all modern compilers can do that),

But the compiler does not know that fstrcmp is called millions of time and
that this piece of code needs to be optimized for speed rather than for space.

Does GCC meanwhile have a heuristic that forces inlining of a 'static'
function that is only used once in the compilation unit?

> - for easier profiling with gprof (when optimization is turned off),

Sorry, I hate to write source code to care about deficiencies of a particular
tool. gprof's deficiency is that it cannot profile below the function level.
For profiling at the line or instruction level, you can use the Google
profiler (Linux: <http://code.google.com/p/google-perftools/>) or Shark
(MacOS X, part of Xcode: 
<http://developer.apple.com/tools/shark_optimize.html>).

> - to eventually look into building a product metric out of them, for
>   "clustering" or so.

Show me that you can do something with the "clustering" idea here, then
I'm open to all patches that you send :-)

> > +     for (i = xvec_length - 1; i >= 0; i--)
> > +       occ_diff[(unsigned char) string1[i]]++;
> > +     /* Subtract the occurrence counts in Y.  */
> > +     for (i = yvec_length - 1; i >= 0; i--)
> > +       occ_diff[(unsigned char) string2[i]]--;
> 
> Just curious: is it advantageous to traverse the strings backwards?

I think the order should have no effect on the cache, since the cache lines
will simply be traversed in opposite order. The next most important topic
to look at, for speed, are the CPU instructions. And indeed, when the compiler
does not apply loop strength reduction (i.e. rearrange the loop in completely
unobvious ways) comparing i against 0 rather than against xvec_length saves
2 instructions inside the loop on x86, 1 instruction on SPARC. See the attached
output, produced with gcc 4.3.0.

> > +     /* Sum up the absolute values.  */
> > +     sum = 0;
> > +     for (i = 0; i <= UCHAR_MAX; i++)
> > +       {
> > +         int d = occ_diff[i];
> > +         sum += (d >= 0 ? d : -d);
> 
> Curious again: is it advantageous to obscure abs like this?
> Compilers know how to optimize abs, no?

Indeed, all gcc versions since 2.95 at least inline an 'abs' call. I'm
just over-cautious.

Bruno


--- lib/fstrcmp.c.orig  2008-09-20 15:29:49.000000000 +0200
+++ lib/fstrcmp.c       2008-09-20 15:16:14.000000000 +0200
@@ -100,6 +100,13 @@
 gl_once_define(static, keys_init_once)
 
 
+/* In the code below, branch probabilities were measured by Ralf Wildenhues,
+   by running "msgmerge LL.po coreutils.pot" with msgmerge 0.18 for many
+   values of LL.  The probability indicates that the condition evaluates
+   to true; whether that leads to a branch or a non-branch in the code,
+   depends on the compiler's reordering of basic blocks.  */
+
+
 double
 fstrcmp_bounded (const char *string1, const char *string2, double lower_bound)
 {
@@ -113,7 +120,7 @@
   size_t bufmax;
 
   /* short-circuit obvious comparisons */
-  if (xvec_length == 0 || yvec_length == 0)
+  if (xvec_length == 0 || yvec_length == 0) /* Prob: 1% */
     return (xvec_length == 0 && yvec_length == 0 ? 1.0 : 0.0);
 
   if (lower_bound > 0)
@@ -138,14 +145,14 @@
        (double) (2 * MIN (xvec_length, yvec_length))
        / (xvec_length + yvec_length);
 
-      if (upper_bound < lower_bound)
+      if (upper_bound < lower_bound) /* Prob: 74% */
        /* Return an arbitrary value < LOWER_BOUND.  */
        return 0.0;
 
 #if CHAR_BIT <= 8
       /* When X and Y are both small, avoid the overhead of setting up an
         array of size 256.  */
-      if (xvec_length + yvec_length >= 20)
+      if (xvec_length + yvec_length >= 20) /* Prob: 99% */
        {
          /* Compute a less quick upper bound.
             Each edit is an insertion or deletion of a character, hence
@@ -185,7 +192,7 @@
 
          upper_bound = 1.0 - (double) sum / (xvec_length + yvec_length);
 
-         if (upper_bound < lower_bound)
+         if (upper_bound < lower_bound) /* Prob: 66% */
            /* Return an arbitrary value < LOWER_BOUND.  */
            return 0.0;
        }
@@ -245,7 +252,7 @@
 
   /* Now do the main comparison algorithm */
   ctxt.edit_count = - ctxt.edit_count_limit;
-  if (compareseq (0, xvec_length, 0, yvec_length, 0, &ctxt))
+  if (compareseq (0, xvec_length, 0, yvec_length, 0, &ctxt)) /* Prob: 98% */
     /* The edit_count passed the limit.  Hence the result would be
        < lower_bound.  We can return any value < lower_bound instead.  */
     return 0.0;

#include <limits.h>
#include <stdlib.h>
#include <string.h>

void
compute_forward (const char *string, unsigned int len, int occ[UCHAR_MAX + 1])
{
  int i;

  memset (occ, 0, (UCHAR_MAX + 1) * sizeof (int));
  for (i = 0; i < len; i++)
    occ[(unsigned char) string[i]]++;
}

void
compute_backward (const char *string, unsigned int len, int occ[UCHAR_MAX + 1])
{
  int i;

  memset (occ, 0, (UCHAR_MAX + 1) * sizeof (int));
  for (i = len - 1; i >= 0; i--)
    occ[(unsigned char) string[i]]++;
}

int
sumup_no_abs (int occ_diff[UCHAR_MAX + 1])
{
  int sum;
  int i;

  sum = 0;
  for (i = 0; i <= UCHAR_MAX; i++)
    {
      int d = occ_diff[i];
      sum += (d >= 0 ? d : -d);
    }
  return sum;
}

int
sumup_abs (int occ_diff[UCHAR_MAX + 1])
{
  int sum;
  int i;

  sum = 0;
  for (i = 0; i <= UCHAR_MAX; i++)
    sum += abs (occ_diff[i]);
  return sum;
}

foo.s
Description: Binary data

Re: msgmerge speedup: fstrcmp and diffseq improvements

Reply via email to