Edit report at https://bugs.php.net/bug.php?id=53823&edit=1
ID: 53823 Comment by: robertbasic dot com at gmail dot com Reported by: keith at chaos-realm dot net Summary: preg_replace: * qualifier on unicode replace garbles the string Status: Verified Type: Bug Package: PCRE related Operating System: Linux PHP Version: 5.3SVN-2011-01-23 (snap) Block user comment: N Private report: N New Comment: I tried my best on this one. Tested against the trunk: svn info | grep Revision Revision: 323476 I created a test file for this, will attach. I ran the following with gdb: $ gdb sapi/cgi/php-cgi and then set a breakpoint (gdb) break php_pcre.c:1318 finally ran the test script like: (gdb) run run-tests.php ext/pcre/tests/bug53823.phpt On https://gist.github.com/1904467 I c/p-ed some output from gdb, but that might be incorrect as I'm fairly new to all this. Anyway, lines 12 and 22 in that gist caught my attention. Also, I think the same issue exists for preg_filter, too. Previous Comments: ------------------------------------------------------------------------ [2011-01-26 08:02:54] ahar...@php.net Verified on 5.3 and trunk. ------------------------------------------------------------------------ [2011-01-23 18:10:44] tino dot didriksen at gmail dot com ...and then I forget to change the *. Let's try that again... These work as expected: echo preg_replace('/[^\pL\pM]+/iu', '', 'áéÃóú'); echo preg_replace('/[^\pL\pM\pN]+/iu', '', 'áéÃóú'); ------------------------------------------------------------------------ [2011-01-23 18:09:23] tino dot didriksen at gmail dot com A workaround is to use + instead of *. These work as expected: echo preg_replace('/[^\pL\pM]*/iu', '', 'áéÃóú'); echo preg_replace('/[^\pL\pM\pN]*/iu', '', 'áéÃóú'); ------------------------------------------------------------------------ [2011-01-23 18:04:49] keith at chaos-realm dot net . ------------------------------------------------------------------------ [2011-01-23 18:00:57] keith at chaos-realm dot net Description: ------------ When using the following test script to strip out all unicode except for letters the string becomes garbled when the * qualifier is added, the only surviving character that is intact is ú. Also, if you add \pN to the exceptions it additionally preserves the ó. Verified on 5.2,5.3 and 5.3-SNAP. Test script: --------------- echo preg_replace('/[^\pL\pM]*/iu', '', 'áéÃóú'); or echo preg_replace('/[^\pL\pM\pN]*/iu', '', 'áéÃóú'); Expected result: ---------------- áéÃóú Actual result: -------------- ����ú or ���óú (if \pN is added to the exceptions). ------------------------------------------------------------------------ -- Edit this bug report at https://bugs.php.net/bug.php?id=53823&edit=1