bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales

Zoltán Herczeg Sun, 28 Sep 2014 03:12:43 -0700

>> In the regex world, matching performance is the key aspect of an engine
>
>Absolutely.  That's why we're having this discussion: libpcre is slow when 
>matching binary data.


For me the question is whether binary search needs to supported on PCRE level. 
There are other questions like this. People ask string replacement support in 
PCRE from time to time. It can be implemented of course, we could add a whole 
string handling API to PCRE. But we feel this is outside the scope of PCRE. Any 
string management library can use PCRE for string replacement, we don't need 
another one.

Binary matching is similar thing. PCRE is used by some (I think closed source) 
projects for network data filtering, which is obviously binary data. They use 
some kind of pre-filtering, data arranging and partial matching to efficiently 
check TCP stream data (without waiting for the whole stream to arrive).

>> A "simple" change like this would require a major redesign of the engine.
>
>It'd be nontrivial, yes.  But it's clearly doable.  (Not that I'm 
>volunteering....)

Anything can be done, which as an algorithmic solution, this was never a 
question. The question is whether it is worth to do it on PCRE or higher level. 
Perl/PCRE is all about text processing, characters which has meanings, types, 
sub-types, other case(s). Unicode is also about defining characters, not binary 
data. UTF is an encoding format, not mapping random bytes to characters.

If this task would be trivial, I wouldn't mind doing it myself. But it seems 
this task is about destroying what we built so far. A lot of extra checks to 
process invalid bytes, a large code size increase, and removing a lot of 
optimizations. The result might be much slower than using clever pre-filtering.

>> What should happen, if the starting offset is inside an otherwise valid UTF 
>> character?
>
>The same thing that would happen if an input file started with the tail end of 
>a 
>UTF-8 sequence.  The leading bytes are invalid.  'grep' deals with this 
>already; 
>it's not a problem.

The question was about intermediate start offsets. You have a 100 byte long 
buffer, and you start matching from byte 50. That is part of a valid UTF byte. 
Your pattern starts with an invalid character, which matches to that UTF 
fragment. You said invalid UTF character matches only themselves, not part of 
other characters. A lot of extra check again, preparing for the worst case.

>> This might be efficient for engines which scans the input only forward 
>> direction
> > and read every character once.
>
>It can also be efficient for matchers, like grep's, that don't necessarily do 
>that.  It just takes more implementation work, that's all.  It's not rocket 
>science to go backwards through a UTF-8 string and to catch decoding errors as 
>you go.

My problem is the lot of "ifs", you need to execute. Lets compare the current 
and the proposed solution.

char* c_ptr /* Current string position. */

Current:
  if (c_ptr == input_start)
    return FAIL;
  c_ptr--;
  while (*cptr & 0xc0 == 0x80)
   cptr--;

Proposed solution:
  if (c_ptr == input_start)
    return FAIL;
  c_ptr--;
  char* saved_c_ptr = c_ptr; /* We need to save the starting position, loosing 
a CPU register for that. */
  while (*cptr & 0xc0 == 0x80) {
    if (c_ptr == input_start)
      return FAIL;
    cptr--;
  }

  /* We moved back a lot, we don't know where are we. Check character length. */
  int length = utf_length[*cptr]; /* Another lost register. Compiler life is 
difficult. */
  if (cptr + length != saved_c_ptr + 1) 
    c_ptr = saved_c_ptr;
  else {
    /* We need to check whether the character is encoded in the minimum number 
of bytes. */
    if (length == 1) {
      /* Great, nothing to do. */
    }
    else if (length == 2) {
      if (*c_ptr < 0xc2) /* Character is <= 127, can be encoded in a single 
byte. */
        c_ptr = saved_c_ptr;
    } else if (length == 3) {
      if (*c_ptr == 0xe0 && cptr[1] < 0xa0) /* Character is <= 0x800, can be 
encoded in less bytes. */
        c_ptr = saved_c_ptr;
    } else
      ....
  }

For me this is way too much checks, and affects compiler optimizations too much.

Regards,
Zoltan

bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales

Reply via email to