htmlspecialchars(), htmlentities(), html_entity_decode() and get_html_translation_table() all take an encoding parameter that used to default to iso-8859-1. We changed the default in PHP 5.4 to UTF-8. This is a much more sensible default and in the case of the encoding functions more secure as it prevents invalid UTF-8 from getting through. If you use 8859-1 as the default but your app is actually in UTF-8 or worse, some encoding that isn't low-ascii compatible then htmlspecialchars()/htmlentities() aren't doing what you think they are and you have a glaring security hole in your app.
However, people are understandably lazy and don't want to think about this stuff. They don't want to explicitly provide their input encoding to these calls. We provided a solution to this and a way to write portable apps and that was to pass in an empty string "" as the encoding. If we saw this we would set the input encoding to match the output encoding specified by the "default_charset" ini setting. We couldn't just default to this default_charset because input and output encodings may very well be different and we would risk making existing apps insecure. For example an app using BIG5/CJK for its output encoding might very well be pulling data from 8859/UTF-8 data sources and if we invisibly switched htmlspecialchars/entities to match their output encoding we would have problems. Invisibly switching them from 8859-1 to UTF-8 could still be problematic, but it at least it fails safe in that it doesn't let invalid UTF-8 through and encodes low-ascii the same way it did before. The problem is that there is a lot of legacy code out there that doesn't explicitly set the encoding on those calls and it is a lot of work to go through and specify it on each call. I still personally prefer to have people be explicit here, but I think it is slowing 5.4 adoption (see bug 61354). In PHP 6 we tried to introduce separate input, script and output encoding settings. Currently in 5.4 we don't have that, but we have those 3 separately for mbstring and for iconv: iconv.input_encoding iconv.internal_encoding iconv.output_encoding mbstring.http_input mbstring.internal_encoding mbstring.http_output Ideally we should be getting rid of the per-feature encoding settings and have a single set of them that we refer to when we need them. This is one of these places where we really need a default input encoding setting. We could have it check mbstring.http_input, but there is a wrinkle here that it has a fancy "auto" setting which we don't really want in this case. So we could set it to iconv.input_encoding, but that seems rather random and unintuitive. So do we create a new default_input_encoding ini directive mid-stream in 5.4 for this? Of course with the longer-term in mind that this will be part of a unified set of encoding settings in 5.5 and beyond. -Rasmus -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: http://www.php.net/unsub.php