Re: [PATCH] Exclude optimization

Bruno Haible Sun, 09 Aug 2009 16:17:05 -0700

Hello Sergey,

> The proposed patch considerably speed-ups the exclude module
> for large exclusion lists of non-wildcard patterns.


Great idea!

'is_fnmatch_pattern' is probably a misnomer, because its argument is
by definition already an fnmatch pattern. What the function is testing is
whether it contains wildcard characters [1]¸ in other words, whether it
matches a single string (modulo case, optionally) or more. Possibly better
names are:
  fnmatch_pattern_has_wildcards
  is_fnmatch_pattern_constant  (for the opposite sense)

Btw, this function does not handle multiple adjacent backslashes correctly,
i.e. patterns such as 'ab\\[cd]'.

Also, I'm surprised that this function does not take an 'options' argument.
For example, the string 'ab\[cd\]' is constant if FNM_NOESCAPE is off, but
non-constant if FNM_NOESCAPE is on.

I would like to see some unit tests committed to this module before the
big optimization.

Thinking more into the future, there are now multiple optimizations of
regex / fnmatch from various sides. (Recall that fnmatch patterns can
be transformed into regular expressions.)
  - grep and awk use a wrapper around regex, composed of 'kwset' and
    'dfa', for speeding up regex. 'kwset' optimizes fgrep-like uses.
  - fnmatch suffers from the API that provides no distinct "compile"
    and "execute" passes. Ondrej Bilka was working on this in May 2009.
  - Now for the 'exclude' module we have a particular optimization
    for the case '^\(word1\|word2\|...\|more complex stuff\)$.
Also, the regex API has a few other shortcomings:
  - It uses the current locale, does not have an explicit locale_t argument,
  - It does not work on Unicode strings or with Unicode character predicates.

I hope there will be a better, unified, optimized regex in GNU some day.

Bruno

[1] http://en.wikipedia.org/wiki/Wildcard_character

Re: [PATCH] Exclude optimization

Reply via email to