Re: msgmerge speedup: fstrcmp and diffseq improvements

Ralf Wildenhues Fri, 19 Sep 2008 12:44:16 -0700

Hi Bruno,

* Bruno Haible wrote on Tue, Sep 16, 2008 at 03:28:51AM CEST:
> 
> > Your draft patch doesn't directly apply to the current CVS
> > tree; I haven't tested and looked at it in detail yet.
> 
> It is all committed now.


Thanks.  Some change between then and now caused differences in the
coreutils po files, though.  For example in de.po:

| @@ -4416,9 +4416,9 @@
|  msgstr "Es ist kein Name zur Nutzer‐ID %lu zu finden"
|  
|  #: src/id.c:268
| -#, c-format
| +#, fuzzy, c-format
|  msgid "uid=%lu"
| -msgstr ""
| +msgstr "id="
|  
|  #: src/id.c:273
|  #, c-format
| @@ -10979,9 +10979,11 @@
|  msgid "exit="
|  msgstr "exit="
|  
| +# 8 chars are okay
|  #: src/who.c:475
| +#, fuzzy
|  msgid "LOGIN"
| -msgstr ""
| +msgstr "LEITUNG"
|  
|  #: src/who.c:495
|  msgid "clock change"

Is that intentional?

> Since clearing and summing up an array of size 256 is a bit expensive if
> both strings are small, I made it conditional on
>    (xvec_length + yvec_length >= 20).
> The threshold 20 was determined experimentally:

Good idea.

I considered recording the min and max character value occurrence, but
then guessed that if that introduces extra branches, it's probably not
worth it.  Summing an array of 256 ints is relatively fast after all.

The shortcut ratios are now (in percentage of remaining pairs taken;
averaged over all .po files in coreutils not needing iconv):

- zero length 1%
- string length difference 74%
- string length sum >= 20: 99%
- frequency bound: 66%
- premature compareseq termination: 98%

compareseq ranges at 86% of the total execution time, so if there are
more cheap bounds, more power to them.

BTW, I now found a newer paper that has some more interesting stuff, if
a bit overkill for msgmerge: <http://fastss.csg.uzh.ch/ifi-2007.02.pdf>.
Also, there is this good overview of algorithms:
<ftp://ftp.dcc.uchile.cl/pub/users/gnavarro/survasm.ps.gz>.

I tried a 2-gram and 4-gram bound but their hit rate was around 1% so
I discarded that again (for the 4-gram that was to be expected due to
the indexing already done).


And yes, I can't help but ask a few more nitty questions again:  ;-)

> + #if CHAR_BIT <= 8

FWIW, CHAR_BIT cannot be smaller than 8.

FWIW2, not inlining the bound functions allows
- for less stack usage (in the frequency bound case), or
- the compiler to decide about it (all modern compilers can do that),
- for easier profiling with gprof (when optimization is turned off),
- to eventually look into building a product metric out of them, for
  "clustering" or so.

> +       for (i = xvec_length - 1; i >= 0; i--)
> +         occ_diff[(unsigned char) string1[i]]++;
> +       /* Subtract the occurrence counts in Y.  */
> +       for (i = yvec_length - 1; i >= 0; i--)
> +         occ_diff[(unsigned char) string2[i]]--;

Just curious: is it advantageous to traverse the strings backwards?

> +       /* Sum up the absolute values.  */
> +       sum = 0;
> +       for (i = 0; i <= UCHAR_MAX; i++)
> +         {
> +           int d = occ_diff[i];
> +           sum += (d >= 0 ? d : -d);

Curious again: is it advantageous to obscure abs like this?
Compilers know how to optimize abs, no?)

Thanks,
Ralf

Re: msgmerge speedup: fstrcmp and diffseq improvements

Reply via email to