On Wed, Sep 23, 2020 at 09:11:44AM +0200, Boudewijn Dijkstra wrote: > Op Thu, 10 Sep 2020 04:01:30 +0200 schreef Bambero <bamb...@gmail.com>: > > Hi, > > > > It seems that perl regular expressions lost one polish letter (ą): > > https://www.compart.com/en/unicode/U+0105 > > > > I can see this problem only under OpenBSD 6.7 with php-7.4 (same version > > of php under linux is OK) > > > > Ex.: > > > > PHP 7.4.10 or 7.4.5 > > <?php var_dump(preg_match('/^.{5,64}$/', 'daswęzdas')); > > int(1) // OK > > > > PHP 7.4.10 or 7.4.5 > > <?php var_dump(preg_match('/^.{5,64}$/', 'daswązdas')); > > int(0) // UPS??? > > > > PHP 7.3.21 > > <?php var_dump(preg_match('/^.{5,64}$/', 'daswęzdas')); > > int(1) // OK > > > > PHP 7.3.21 > > <?php var_dump(preg_match('/^.{5,64}$/', 'daswązdas')); > > int(1) // OK > > > > Any ideas how to fix that? > > > > Regards, > > Bambero > > The same happens with any UTF-8 sequence that ends in 0x85. I guess (a part > of) PHP's PCRE code is not in UTF-8 mode, causing triggers on CHAR_NEL > (=0x85).
I don't know a lot about PHP or the external PCRE library, but my guess would be that php is treating the string as bytes not characters. Can you try using the "u" (PCRE_UTF8) modifier? https://www.php.net/manual/en/reference.pcre.pattern.modifiers.php > for ($i = 0x75; $i <= 0x825; $i++) { > $u = mb_chr($i); > $str = 'dasw' . $u . 'zdas'; > $r = preg_match('/^.{5,64}$/', $str); > if ($r == 0) { > printf("%04x:", $i); > for ($j = 0; $j < strlen($u); $j++) { > $b = ord(substr($str, 4 + $j)); > printf(" %02x", $b); > } > printf(": %s\n", $str); > } > } > > > -- > Gemaakt met Opera's e-mailprogramma: http://www.opera.com/mail/ >