htmlspecialchars(), htmlentities(), html_entity_decode() and
get_html_translation_table() all take an encoding parameter that used to
default to iso-8859-1. We changed the default in PHP 5.4 to UTF-8. This
is a much more sensible default and in the case of the encoding
functions more secure as it prevents invalid UTF-8 from getting through.
If you use 8859-1 as the default but your app is actually in UTF-8 or
worse, some encoding that isn't low-ascii compatible then
htmlspecialchars()/htmlentities() aren't doing what you think they are
and you have a glaring security hole in your app.

However, people are understandably lazy and don't want to think about
this stuff. They don't want to explicitly provide their input encoding
to these calls. We provided a solution to this and a way to write
portable apps and that was to pass in an empty string "" as the
encoding. If we saw this we would set the input encoding to match the
output encoding specified by the "default_charset" ini setting. We
couldn't just default to this default_charset because input and output
encodings may very well be different and we would risk making existing
apps insecure. For example an app using BIG5/CJK for its output encoding
might very well be pulling data from 8859/UTF-8 data sources and if we
invisibly switched htmlspecialchars/entities to match their output
encoding we would have problems. Invisibly switching them from 8859-1 to
UTF-8 could still be problematic, but it at least it fails safe in that
it doesn't let invalid UTF-8 through and encodes low-ascii the same way
it did before.

The problem is that there is a lot of legacy code out there that doesn't
explicitly set the encoding on those calls and it is a lot of work to go
through and specify it on each call. I still personally prefer to have
people be explicit here, but I think it is slowing 5.4 adoption (see bug
61354).

In PHP 6 we tried to introduce separate input, script and output
encoding settings. Currently in 5.4 we don't have that, but we have
those 3 separately for mbstring and for iconv:

iconv.input_encoding
iconv.internal_encoding
iconv.output_encoding
mbstring.http_input
mbstring.internal_encoding
mbstring.http_output

Ideally we should be getting rid of the per-feature encoding settings
and have a single set of them that we refer to when we need them. This
is one of these places where we really need a default input encoding
setting. We could have it check mbstring.http_input, but there is a
wrinkle here that it has a fancy "auto" setting which we don't really
want in this case. So we could set it to iconv.input_encoding, but that
seems rather random and unintuitive.

So do we create a new default_input_encoding ini directive mid-stream in
5.4 for this? Of course with the longer-term in mind that this will be
part of a unified set of encoding settings in 5.5 and beyond.

-Rasmus

-- 
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php

Reply via email to