Hello,

Le 24/09/2020 à 19:44, Andrew Hewus Fresh a écrit :
On Thu, Sep 24, 2020 at 11:30:35AM +0200, Boudewijn Dijkstra wrote:
Op Thu, 24 Sep 2020 02:56:51 +0200 schreef Andrew Hewus Fresh
<and...@afresh1.com>:
On Wed, Sep 23, 2020 at 09:11:44AM +0200, Boudewijn Dijkstra wrote:
Op Thu, 10 Sep 2020 04:01:30 +0200 schreef Bambero <bamb...@gmail.com>:
Hi,

It seems that perl regular expressions lost one polish letter (ą):
https://www.compart.com/en/unicode/U+0105

I can see this problem only under OpenBSD 6.7 with php-7.4 (same >
version of php under linux is OK)

Ex.:

PHP 7.4.10 or 7.4.5
<?php var_dump(preg_match('/^.{5,64}$/', 'daswęzdas'));
int(1) // OK

PHP 7.4.10 or 7.4.5
<?php var_dump(preg_match('/^.{5,64}$/', 'daswązdas'));
int(0) // UPS???

PHP 7.3.21
<?php var_dump(preg_match('/^.{5,64}$/', 'daswęzdas'));
int(1) // OK

PHP 7.3.21
<?php var_dump(preg_match('/^.{5,64}$/', 'daswązdas'));
int(1) // OK

Any ideas how to fix that?

Regards,
Bambero

The same happens with any UTF-8 sequence that ends in 0x85.  I guess
(a part of) PHP's PCRE code is not in UTF-8 mode, causing triggers
onCHAR_NEL (=0x85).

I don't know a lot about PHP or the external PCRE library, but my guess
would be that php is treating the string as bytes not characters.  Can
you try using the "u" (PCRE_UTF8) modifier?

https://www.php.net/manual/en/reference.pcre.pattern.modifiers.php

Indeed with "u" the expected 1 is returned! Now the question is, why is this
needed on OpenBSD but not in Linux or Windows?

There are many unicode related changes in php 7.4, so I'm sure they
fixed something.
https://www.php.net/ChangeLog-7.php

I would guess that linux and windows both default to a UTF-8 locale,
while OpenBSD defaults to the C locale.

Does the out put from locale(1) provide you any hints?

Do you get any different results testing it with `LC_ALL=en_US.UTF-8`?

I don't know enough about php to know how it determines what locale to
use, so that may not have any effect, or you may need to adjust
something else.

The default encoding is UTF-8 but preg_* functions() don't follow default_charset, input_encoding, output_encoding or internal_encoding configs like the multibytes library.

You need to add the u modifier explicitly yourself each time you work with an UTF-8 string. No global config flag for that...

And don't relay on mb_detect_encoding() it's utterly broken, unless you are sure you are in a case compatible with its limitations.

Regards,

--
Stéphane Aulery

Reply via email to