Edit report at https://bugs.php.net/bug.php?id=65082&edit=1
ID: 65082 User updated by: masakielastic at gmail dot com Reported by: masakielastic at gmail dot com Summary: json_encode's option for replacing ill-formd byte sequences with substitute cha Status: Assigned Type: Feature/Change Request Package: JSON related Operating System: All PHP Version: 5.5.0 Assigned To: remi Block user comment: N Private report: N New Comment: I created a repo for the patches and the report of benchmarks https://github.com/masakielastic/patches/tree/master/php_bugs_65082 The difference between json_utf8_to_utf16 and json_utf8_to_utf32 isn't seen. the use of json_utf8_to_utf32 or the direct use of php_next_utf8_char in json_escape_string is better choice for JSON_NOTUTF8_SUBSTITUTE and JSON_NOTUTF8_SUBSTITUTE|JSON_UNESCAPED_UNICODE. php_next_utf8_char in json_escape_string is a bit faster than json_utf8_to_utf32 for JSON_NOTUTF8_SUBSTITUTE. https://github.com/masakielastic/patches/blob/master/php_bugs_65082/04_php_next_ utf8_char_in_json_escape_string.patch https://github.com/masakielastic/patches/blob/master/php_bugs_65082/04_php_next_ utf8_char_in_json_escape_string.c Previous Comments: ------------------------------------------------------------------------ [2013-07-19 16:46:49] masakielastic at gmail dot com Another way of perfomance improvemnet is using php_next_utf8_char directly in json_escape_string on the condition of PHP_JSON_NOTUTF8_SUBSTITUTE and PHP_JSON_NOTUTF8_IGNORE. This way reduces one loop compared with using json_utf8_to_utf16. ------------------------------------------------------------------------ [2013-07-19 16:33:24] masakielastic at gmail dot com I agree with you on isolated surrogate pairs. The test cases for json_decode and JSON_NOTUTF8_SUBSTITUTE and JSON_NOTUTF8_IGNORE must be contained since json_decode uses json_utf8_to_utf16. https://github.com/php/php-src/blob/master/ext/json/json.c#L673 I already posted the test cases. https://gist.github.com/masakielastic/5973095#file-04-test-php-L26 "a\xEF\xBF\xBD" === json_decode('"'."a\x80".'"', false, 512, JSON_NOTUTF8_SUBSTITUTE), "a" === json_decode('"'."a\x80".'"', false, 512, JSON_NOTUTF8_IGNORE) The one way of perfomance improvement is adding json_utf8_to_utf32. I posted another patch. https://gist.github.com/masakielastic/5973095#file-02-json_unescaped_unicode- patch I created unsigned int *utf32 data type for not changing unsigned short *utf16 data type. If you want to provide a common variable for json_utf8_to_utf16 and json_utf8_to_utf32, the modification for JSON_parser.c is also needed. The one of candidate for the name of variable is unsigned int *code_codes. http://www.unicode.org/glossary/#code_unit I also updated the previous patch. https://gist.github.com/masakielastic/5973095#file-01-json_unescaped_unicode- patch if (options & PHP_JSON_UNESCAPED_UNICODE) { + if (us < 0x20) { + smart_str_appendl(buf, "\\u", 2); + smart_str_appendc(buf, digits[(us >> 12) & 0xf]); + smart_str_appendc(buf, digits[(us >> 8) & 0xf]); + smart_str_appendc(buf, digits[(us >> 4) & 0xf]); + smart_str_appendc(buf, digits[(us & 0xf)]); + } else if (us < 0x80) { ------------------------------------------------------------------------ [2013-07-15 07:31:49] r...@php.net > Hi remi, could you test my patch for PHP_JSON_UNESCAPED_UNICODE option? > The patch adopts JSON_NOTUTF8_SUBSTITUTE and JSON_NOTUTF8_IGNORE options. The PHP_JSON_UNESCAPED_UNICODE + JSON_NOTUTF8_IGNORE already works with my patch. Yes, PHP_JSON_UNESCAPED_UNICODE + JSON_NOTUTF8_SUBSTITUTE doesn't work for now, but converting to utf16, then back to utf8 seems really... messy. Need something simpler. Notice: this bug is only for json_encode. Other issue have their own bug for tracking (especially the json_decode one, as I dont plan to alter it) ------------------------------------------------------------------------ [2013-07-14 12:45:47] masakielastic at gmail dot com As for JSON_NOTUTF8_IGNORE, the description for security is needed in the manual like htmlspecialchars's ENT_IGNORE http://www.php.net/manual/en/function.htmlspecialchars.php That's why I didn't sugguest JSON_IGNORE in the draft and showed Escaping RFC's link as resource. UNICODE SECURITY CONSIDERATIONS http://www.unicode.org/reports/tr36/#Deletion_of_Noncharacters IDS11-J. Eliminate noncharacter code points before validation https://www.securecoding.cert.org/confluence/display/java/IDS11- J.+Eliminate+noncharacter+code+points+before+validation ------------------------------------------------------------------------ [2013-07-14 12:31:29] masakielastic at gmail dot com Hi, nikic, sorry, ignore my last comment. I added small change in json.c https://gist.github.com/masakielastic/5973095#file-02-small_refactaring-patch ------------------------------------------------------------------------ The remainder of the comments for this report are too long. To view the rest of the comments, please view the bug report online at https://bugs.php.net/bug.php?id=65082 -- Edit this bug report at https://bugs.php.net/bug.php?id=65082&edit=1