Re: UTF-8 problem with php-7.4

Andrew Hewus Fresh Wed, 23 Sep 2020 17:57:47 -0700

On Wed, Sep 23, 2020 at 09:11:44AM +0200, Boudewijn Dijkstra wrote:
> Op Thu, 10 Sep 2020 04:01:30 +0200 schreef Bambero <bamb...@gmail.com>:
> > Hi,
> > 
> > It seems that perl regular expressions lost one polish letter (ą):
> > https://www.compart.com/en/unicode/U+0105
> > 
> > I can see this problem only under OpenBSD 6.7 with php-7.4 (same version
> > of php under linux is OK)
> > 
> > Ex.:
> > 
> > PHP 7.4.10 or 7.4.5
> > <?php var_dump(preg_match('/^.{5,64}$/', 'daswęzdas'));
> > int(1) // OK
> > 
> > PHP 7.4.10 or 7.4.5
> > <?php var_dump(preg_match('/^.{5,64}$/', 'daswązdas'));
> > int(0) // UPS???
> > 
> > PHP 7.3.21
> > <?php var_dump(preg_match('/^.{5,64}$/', 'daswęzdas'));
> > int(1) // OK
> > 
> > PHP 7.3.21
> > <?php var_dump(preg_match('/^.{5,64}$/', 'daswązdas'));
> > int(1) // OK
> > 
> > Any ideas how to fix that?
> > 
> > Regards,
> > Bambero
> 
> The same happens with any UTF-8 sequence that ends in 0x85.  I guess (a part
> of) PHP's PCRE code is not in UTF-8 mode, causing triggers on CHAR_NEL
> (=0x85).



I don't know a lot about PHP or the external PCRE library, but my guess
would be that php is treating the string as bytes not characters.  Can
you try using the "u" (PCRE_UTF8) modifier?

https://www.php.net/manual/en/reference.pcre.pattern.modifiers.php



> for ($i = 0x75; $i <= 0x825; $i++) {
>         $u = mb_chr($i);
>         $str = 'dasw' . $u . 'zdas';
>         $r = preg_match('/^.{5,64}$/', $str);
>         if ($r == 0) {
>                 printf("%04x:", $i);
>                 for ($j = 0; $j < strlen($u); $j++) {
>                         $b = ord(substr($str, 4 + $j));
>                         printf(" %02x", $b);
>                 }
>                 printf(": %s\n", $str);
>         }
> }
> 
> 
> -- 
> Gemaakt met Opera's e-mailprogramma: http://www.opera.com/mail/
>

Re: UTF-8 problem with php-7.4

Reply via email to