Edit report at https://bugs.php.net/bug.php?id=65082&edit=1
ID: 65082 User updated by: masakielastic at gmail dot com Reported by: masakielastic at gmail dot com Summary: json_encode's option for replacing ill-formd byte sequences with substitute cha Status: Assigned Type: Feature/Change Request Package: JSON related Operating System: All PHP Version: 5.5.0 Assigned To: remi Block user comment: N Private report: N New Comment: I agree with you on isolated surrogate pairs. The test cases for json_decode and JSON_NOTUTF8_SUBSTITUTE and JSON_NOTUTF8_IGNORE must be contained since json_decode uses json_utf8_to_utf16. https://github.com/php/php-src/blob/master/ext/json/json.c#L673 I already posted the test cases. https://gist.github.com/masakielastic/5973095#file-04-test-php-L26 "a\xEF\xBF\xBD" === json_decode('"'."a\x80".'"', false, 512, JSON_NOTUTF8_SUBSTITUTE), "a" === json_decode('"'."a\x80".'"', false, 512, JSON_NOTUTF8_IGNORE) The one way of perfomance improvement is adding json_utf8_to_utf32. I posted another patch. https://gist.github.com/masakielastic/5973095#file-02-json_unescaped_unicode- patch I created unsigned int *utf32 data type for not changing unsigned short *utf16 data type. If you want to provide a common variable for json_utf8_to_utf16 and json_utf8_to_utf32, the modification for JSON_parser.c is also needed. The one of candidate for the name of variable is unsigned int *code_codes. http://www.unicode.org/glossary/#code_unit I also updated the previous patch. https://gist.github.com/masakielastic/5973095#file-01-json_unescaped_unicode- patch if (options & PHP_JSON_UNESCAPED_UNICODE) { + if (us < 0x20) { + smart_str_appendl(buf, "\\u", 2); + smart_str_appendc(buf, digits[(us >> 12) & 0xf]); + smart_str_appendc(buf, digits[(us >> 8) & 0xf]); + smart_str_appendc(buf, digits[(us >> 4) & 0xf]); + smart_str_appendc(buf, digits[(us & 0xf)]); + } else if (us < 0x80) { Previous Comments: ------------------------------------------------------------------------ [2013-07-15 07:31:49] r...@php.net > Hi remi, could you test my patch for PHP_JSON_UNESCAPED_UNICODE option? > The patch adopts JSON_NOTUTF8_SUBSTITUTE and JSON_NOTUTF8_IGNORE options. The PHP_JSON_UNESCAPED_UNICODE + JSON_NOTUTF8_IGNORE already works with my patch. Yes, PHP_JSON_UNESCAPED_UNICODE + JSON_NOTUTF8_SUBSTITUTE doesn't work for now, but converting to utf16, then back to utf8 seems really... messy. Need something simpler. Notice: this bug is only for json_encode. Other issue have their own bug for tracking (especially the json_decode one, as I dont plan to alter it) ------------------------------------------------------------------------ [2013-07-14 12:45:47] masakielastic at gmail dot com As for JSON_NOTUTF8_IGNORE, the description for security is needed in the manual like htmlspecialchars's ENT_IGNORE http://www.php.net/manual/en/function.htmlspecialchars.php That's why I didn't sugguest JSON_IGNORE in the draft and showed Escaping RFC's link as resource. UNICODE SECURITY CONSIDERATIONS http://www.unicode.org/reports/tr36/#Deletion_of_Noncharacters IDS11-J. Eliminate noncharacter code points before validation https://www.securecoding.cert.org/confluence/display/java/IDS11- J.+Eliminate+noncharacter+code+points+before+validation ------------------------------------------------------------------------ [2013-07-14 12:31:29] masakielastic at gmail dot com Hi, nikic, sorry, ignore my last comment. I added small change in json.c https://gist.github.com/masakielastic/5973095#file-02-small_refactaring-patch ------------------------------------------------------------------------ [2013-07-14 08:48:01] masakielastic at gmail dot com I nominate other names from the view of consistency with JSON_ERROR_UTF8. JSON_UTF8_SUBSTITUTE JSON_UTF8_IGNORE ------------------------------------------------------------------------ [2013-07-14 08:44:02] masakielastic at gmail dot com Hi, nikic, I posted a document request for the mission option and error codes. https://bugs.php.net/bug.php?id=65259 Your opinion about the consistency among JSON_PARTIAL_OUTPUT_ON_ERROR and JSON_NOTUTF8_SUBSTITUTE and JSON_NOTUTF8_IGNORE is needed. ------------------------------------------------------------------------ The remainder of the comments for this report are too long. To view the rest of the comments, please view the bug report online at https://bugs.php.net/bug.php?id=65082 -- Edit this bug report at https://bugs.php.net/bug.php?id=65082&edit=1