On Sat, Dec 26, 2020 at 12:03 PM Craig Francis <cr...@craigfrancis.co.uk> wrote: > Could htmlspecialchars() use ENT_QUOTES by default? > [...] > I'd also be tempted to suggest ENT_SUBSTITUTE should be included, as I prefer > to keep as much of the valid data (rather than losing everything), but that's > not as important as escaping the apostrophe by default.
On Thu, 7 Jan 2021 at 09:00, Claude Pache <claude.pa...@gmail.com> wrote: > For ENT_SUBSTITUTE, there has been https://bugs.php.net/bug.php?id=69450, but > I don’t understand the objection in that bug report. Maybe there is some > issue related to non-Unicode multibyte encodings? On Thu, 7 Jan 2021 at 09:29, Tomas Kuliavas <to...@users.sourceforge.net> wrote: > Only ISO-2022 encodings got bytes that can match symbols sanitized by > htmlspecialchars. > > Bug objection insist that utf-8 parsing rules should be enacted by sanitizing > function and not by application which displays text. And PHP code is enacting > those rules in most unfriendly API way. Does anyone have an example where ENT_SUBSTITUTE could be used to create an issue? ideally a security issue, but anything will do. With `htmlspecialchars($user_value)`, I don't think it would matter if it ended with <multibyte start byte>, like the example from Rasmus (0xE0), because that end byte would be replaced by U+FFFD. With `htmlspecialchars($user_value . $system_value)`, if $user_value ends with <multibyte start byte>, it's possible some characters at the beginning of $system_value could be replaced. But I can't find a way to do that with UTF-8; and even if it was possible, I would have thought some characters being replaced by U+FFFD, would be a much better solution than everything being lost (noting that $system_value will not contain any HTML characters, because they are escaped as well). echo '<p>' . htmlspecialchars($user . ' is lying to you') . '</p>'; With: '<p>ABC�s lying to you</p>' Without: '<p></p>' And, in both cases, the output is valid UTF-8, and shouldn't affect anything it's concatenated with (i.e. the HTML context). Personally, I think every part of our processes (input, processing, and output) should do its best to handle encoding issues (incase something is missed). I believe ENT_SUBSTITUTE is the best way to deal with it during output. I don't think it's realistic to expect every single PHP developer to check for invalid characters in every single bit of input. That said (and just to make things even more complicated), considering this is HTML encoding, we could go even further and add ENT_DISALLOWED. As Hans noted, some characters, such as 0x01, can be seen as valid in general, but not valid for HTML (where Text Nodes "must not contain control characters other than space characters"). All browsers seem to handle these control characters (by ignoring them), so I'm not too worried, but if it makes things safer, why not? Craig -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: https://www.php.net/unsub.php