>> In the regex world, matching performance is the key aspect of an engine > >Absolutely. That's why we're having this discussion: libpcre is slow when >matching binary data.
For me the question is whether binary search needs to supported on PCRE level. There are other questions like this. People ask string replacement support in PCRE from time to time. It can be implemented of course, we could add a whole string handling API to PCRE. But we feel this is outside the scope of PCRE. Any string management library can use PCRE for string replacement, we don't need another one. Binary matching is similar thing. PCRE is used by some (I think closed source) projects for network data filtering, which is obviously binary data. They use some kind of pre-filtering, data arranging and partial matching to efficiently check TCP stream data (without waiting for the whole stream to arrive). >> A "simple" change like this would require a major redesign of the engine. > >It'd be nontrivial, yes. But it's clearly doable. (Not that I'm >volunteering....) Anything can be done, which as an algorithmic solution, this was never a question. The question is whether it is worth to do it on PCRE or higher level. Perl/PCRE is all about text processing, characters which has meanings, types, sub-types, other case(s). Unicode is also about defining characters, not binary data. UTF is an encoding format, not mapping random bytes to characters. If this task would be trivial, I wouldn't mind doing it myself. But it seems this task is about destroying what we built so far. A lot of extra checks to process invalid bytes, a large code size increase, and removing a lot of optimizations. The result might be much slower than using clever pre-filtering. >> What should happen, if the starting offset is inside an otherwise valid UTF >> character? > >The same thing that would happen if an input file started with the tail end of >a >UTF-8 sequence. The leading bytes are invalid. 'grep' deals with this >already; >it's not a problem. The question was about intermediate start offsets. You have a 100 byte long buffer, and you start matching from byte 50. That is part of a valid UTF byte. Your pattern starts with an invalid character, which matches to that UTF fragment. You said invalid UTF character matches only themselves, not part of other characters. A lot of extra check again, preparing for the worst case. >> This might be efficient for engines which scans the input only forward >> direction > > and read every character once. > >It can also be efficient for matchers, like grep's, that don't necessarily do >that. It just takes more implementation work, that's all. It's not rocket >science to go backwards through a UTF-8 string and to catch decoding errors as >you go. My problem is the lot of "ifs", you need to execute. Lets compare the current and the proposed solution. char* c_ptr /* Current string position. */ Current: if (c_ptr == input_start) return FAIL; c_ptr--; while (*cptr & 0xc0 == 0x80) cptr--; Proposed solution: if (c_ptr == input_start) return FAIL; c_ptr--; char* saved_c_ptr = c_ptr; /* We need to save the starting position, loosing a CPU register for that. */ while (*cptr & 0xc0 == 0x80) { if (c_ptr == input_start) return FAIL; cptr--; } /* We moved back a lot, we don't know where are we. Check character length. */ int length = utf_length[*cptr]; /* Another lost register. Compiler life is difficult. */ if (cptr + length != saved_c_ptr + 1) c_ptr = saved_c_ptr; else { /* We need to check whether the character is encoded in the minimum number of bytes. */ if (length == 1) { /* Great, nothing to do. */ } else if (length == 2) { if (*c_ptr < 0xc2) /* Character is <= 127, can be encoded in a single byte. */ c_ptr = saved_c_ptr; } else if (length == 3) { if (*c_ptr == 0xe0 && cptr[1] < 0xa0) /* Character is <= 0x800, can be encoded in less bytes. */ c_ptr = saved_c_ptr; } else .... } For me this is way too much checks, and affects compiler optimizations too much. Regards, Zoltan