Edit report at https://bugs.php.net/bug.php?id=65082&edit=1

 ID:                 65082
 User updated by:    masakielastic at gmail dot com
 Reported by:        masakielastic at gmail dot com
 Summary:            json_encode's option for replacing ill-formd byte
                     sequences with substitute cha
 Status:             Assigned
 Type:               Feature/Change Request
 Package:            JSON related
 Operating System:   All
 PHP Version:        5.5.0
 Assigned To:        remi
 Block user comment: N
 Private report:     N

 New Comment:

I created a repo for the patches and the report of benchmarks

https://github.com/masakielastic/patches/tree/master/php_bugs_65082

The difference between json_utf8_to_utf16 and json_utf8_to_utf32 isn't seen.

the use of json_utf8_to_utf32 or the direct use of php_next_utf8_char 
in json_escape_string is better choice for 
JSON_NOTUTF8_SUBSTITUTE and JSON_NOTUTF8_SUBSTITUTE|JSON_UNESCAPED_UNICODE.

php_next_utf8_char in json_escape_string is a bit faster than
json_utf8_to_utf32 for JSON_NOTUTF8_SUBSTITUTE.

https://github.com/masakielastic/patches/blob/master/php_bugs_65082/04_php_next_
utf8_char_in_json_escape_string.patch
https://github.com/masakielastic/patches/blob/master/php_bugs_65082/04_php_next_
utf8_char_in_json_escape_string.c


Previous Comments:
------------------------------------------------------------------------
[2013-07-19 16:46:49] masakielastic at gmail dot com

Another way of perfomance improvemnet is using php_next_utf8_char directly 
in json_escape_string on the condition of PHP_JSON_NOTUTF8_SUBSTITUTE 
and PHP_JSON_NOTUTF8_IGNORE. 
This way reduces one loop compared with using json_utf8_to_utf16.

------------------------------------------------------------------------
[2013-07-19 16:33:24] masakielastic at gmail dot com

I agree with you on isolated surrogate pairs.

The test cases for json_decode and JSON_NOTUTF8_SUBSTITUTE and 
JSON_NOTUTF8_IGNORE must be contained 
since json_decode uses json_utf8_to_utf16. 

https://github.com/php/php-src/blob/master/ext/json/json.c#L673

I already posted the test cases.

https://gist.github.com/masakielastic/5973095#file-04-test-php-L26

"a\xEF\xBF\xBD" === json_decode('"'."a\x80".'"', false, 512, 
JSON_NOTUTF8_SUBSTITUTE),
"a" === json_decode('"'."a\x80".'"', false, 512, JSON_NOTUTF8_IGNORE)


The one way of perfomance improvement is adding json_utf8_to_utf32. 
I posted  another patch.

https://gist.github.com/masakielastic/5973095#file-02-json_unescaped_unicode-
patch

I created unsigned int *utf32 data type 
for not changing unsigned short *utf16 data type.

If you want to provide a common variable  
for json_utf8_to_utf16 and json_utf8_to_utf32, 
the modification for JSON_parser.c is also needed.

The one of candidate for the name of variable is 
unsigned int *code_codes.

http://www.unicode.org/glossary/#code_unit


I also updated the previous patch.
https://gist.github.com/masakielastic/5973095#file-01-json_unescaped_unicode-
patch

if (options & PHP_JSON_UNESCAPED_UNICODE) {
+    if (us < 0x20) {
+        smart_str_appendl(buf, "\\u", 2);
+        smart_str_appendc(buf, digits[(us >> 12) & 0xf]);
+        smart_str_appendc(buf, digits[(us >> 8) & 0xf]);
+        smart_str_appendc(buf, digits[(us >> 4) & 0xf]);
+        smart_str_appendc(buf, digits[(us & 0xf)]);
+    } else if (us < 0x80) {

------------------------------------------------------------------------
[2013-07-15 07:31:49] r...@php.net

> Hi remi, could you test my patch for PHP_JSON_UNESCAPED_UNICODE option?
> The patch adopts JSON_NOTUTF8_SUBSTITUTE and JSON_NOTUTF8_IGNORE options.

The PHP_JSON_UNESCAPED_UNICODE + JSON_NOTUTF8_IGNORE already works with my 
patch.

Yes, PHP_JSON_UNESCAPED_UNICODE + JSON_NOTUTF8_SUBSTITUTE doesn't work for now, 
but converting to utf16, then back to utf8 seems really... messy. Need 
something simpler.

Notice: this bug is only for json_encode. Other issue have their own bug for 
tracking (especially the json_decode one, as I dont plan to alter it)

------------------------------------------------------------------------
[2013-07-14 12:45:47] masakielastic at gmail dot com

As for JSON_NOTUTF8_IGNORE, the description for security is needed in the 
manual 
like htmlspecialchars's ENT_IGNORE

http://www.php.net/manual/en/function.htmlspecialchars.php

That's why I didn't sugguest JSON_IGNORE in the draft and showed Escaping RFC's 
link
as resource.

UNICODE SECURITY CONSIDERATIONS
http://www.unicode.org/reports/tr36/#Deletion_of_Noncharacters
IDS11-J. Eliminate noncharacter code points before validation
https://www.securecoding.cert.org/confluence/display/java/IDS11-
J.+Eliminate+noncharacter+code+points+before+validation

------------------------------------------------------------------------
[2013-07-14 12:31:29] masakielastic at gmail dot com

Hi, nikic, sorry, ignore my last comment.

I added small change in json.c
https://gist.github.com/masakielastic/5973095#file-02-small_refactaring-patch

------------------------------------------------------------------------


The remainder of the comments for this report are too long. To view
the rest of the comments, please view the bug report online at

    https://bugs.php.net/bug.php?id=65082


-- 
Edit this bug report at https://bugs.php.net/bug.php?id=65082&edit=1

Reply via email to