Em 2013-06-28 4:10, Kris Craig escreveu:
On Thu, Jun 27, 2013 at 6:43 PM, Yasuo Ohgaki <yohg...@ohgaki.net> wrote:

2013/6/27 Kris Craig <kris.cr...@gmail.com>

Yeah I tried html_entity_decode already, but it just returned NULL. On the same input string, htmlspecialchars_decode returned the input string but with *some* special characters decoded; 10 and 13 ("\r\n", I think) were left in their encoded state. I'm not sure why there wouldn't be an
option to decode all html special characters.


You are missing the design purpose of htmlspecialchars_decode and html_entity_decode. Thruth is, they are not useful as they might seem. Their purpose is not to decode all the entities, like a browser would do. We do not implement anything approaching the sort parsing a browser would do; for instance, html 5 says you should accept certain entities not terminated with ; and parse the stream in a certain way and we don't do it at all. The purpose of those two functions is just to provide something approaching an inverse function for htmlspecialchars() and htmlentities(). html_entity_decode() has somewhat deviated from this (for instance, it decodes all numeric entites), but I think this should nevertheless be the proper way one should think about those two functions.


Not only HTML entities, we really needs to add several decoder/encoder to
core.
For instance, Javascript \uXXXX, HTML &#XX/&#XXXX, etc.
I hope someone is working on it :)


Would you be interested in co-authoring an RFC with me for this?


See http://php.net/manual/en/transliterator.transliterate.php For HTML entities, out of the box, only a transliterator for numeric entities is provided (hex-any/XML10), but you can easily build your ruleset for the named entities. The performance will be below of that of a dedicated algorithm, though. And it only supports UTF-8.

--
Gustavo Lopes

--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php

Reply via email to