Re: [Patch] faster fnmatch

Bruno Haible Fri, 01 May 2009 05:52:31 -0700

Hello Ondrej,

Nice work already!


> *abc*
> en_US.UTF-8
>  user: 2.71
>  user: 0.23
> ...
> *ab[a-z]*
> en_US.UTF-8
>  user: 2.85
>  user: 3.32
> C
>  user: 0.86
>  user: 1.25

Hmm, well, users are expecting a speedup through your module always. If it
some (not too rare) cases the glibc fnmatch is faster, it's difficult to
convince people to use your function.

> Still I assume UTF8 or singlebyte encoding otherwise I fallback.

The general and hard case is the multibyte encoding case, where the encoding
is something like GB18030, BIG5, or similar. Single-byte encodings may be
worth optimizing for, because the C locale on glibc systems uses a single-byte
encoding. Whether UTF-8 is worth special-casing, depends on the complexity.

> I could generalize this to wide characters but if I dont know any character
> is single wchar I cannot do much (UTF16 wchar in windows).

It's OK to pretend that every wchar_t is a single character. I.e. you can
assume that when you use mbstowcs on a string, you get an array of characters.
Yes, on Windows, wchar_t is UTF-16 probably, but in practice this does not
make much of a difference, because the Unicode characters in the range
U+10000..U+10FFFF are not among the "interesting" characters (patterns that
contain such a character inside brackets are rare) and because they occur
rarely enough.

> +      This program is free software; you can redistribute it and/or modify
> +      it under the terms of the GNU General Public License as published by
> +      the Free Software Foundation; either version 2, or (at your option)
> +      any later version.
> +
> +      This program is distributed in the hope that it will be useful,
> +      but WITHOUT ANY WARRANTY; without even the implied warranty of
> +      MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> +      GNU General Public License for more details.
> +
> +      You should have received a copy of the GNU General Public License
> +      along with this program; if not, write to the Free Software Foundation,
> +      Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301, USA.  */

For code in gnulib, please use the same copyright header as in many other
files, namely GPLv3+ with a pointer to the FSF's web site rather than with
its postal address.

> +int fnmatch_exec (fnmatch_state * pattern, const char *string);

Could this function prototype also be
   int fnmatch_exec (const fnmatch_state * pattern, const char *string);
? If not, this means fnmatch_exec modifies the compiled state, which
makes it not multithread-safe to use the same fnmatch_state in multiple
threads?

> +struct _fnmatch_state;
> +typedef struct _fnmatch_state fnmatch_state;

Is it really a "state", i.e. a data structure that is modified over and
over again? I would have expected a compiled pattern data structure, that
is created by fnmatch_compile and then left unchanged until fnmatch_free.

> +#include <stdint.h>
> +
> +#define  _GNU_SOURCE

"#define GNU_SOURCE" has no effect after a system include file such as 
<stdint.h>
was already included. Also, "#define _GNU_SOURCE" has no effect on Solaris.
For this reason, gnulib has a module 'extensions' which does the right thing
(and which you're already using).

> +#define  CHARSPERINT 4

For better portability, write this as
  #define CHARSPERINT (sizeof (int) / sizeof (char))

> +  uint32_t hash[8];

For better portability, write (UCHAR_MAX / 32) + 1 instead of 8.

> +int
> +ascii_compatible_encoding (char *c)

Could this function be 'static'? And take a 'const char *' argument?

> +void
> +strfold (const char *str, char *buf)
> +{
> +  while (*str)
> +    *buf++ = tolower (*str++);

tolower() is only useful in single-byte locales. For support of
multibyte locales, gnulib offers the functions mbmemcasecmp, mbscasecmp,
mbspcasecmp.

> +  for (pass = 0; pass < 2; pass++)
> +    {                           //   two passes in first we compute values 
> we use
> +      patsize = size;           //     in second to optimalize.

Not all C compilers support ISO C 99 / C++ style comments. Better use the old
style of comments, even if it means more typing.

> +            case '[':;
> +              int siz = parse_bracket (s + 1, pat, flags);

Declarations after statements also are only a C99 / C++ feature. For ANSI C,
you need to put braces around this block.

Like Ralf, I have no read through most of the code yet.

Bruno

Re: [Patch] faster fnmatch

Reply via email to