Hi, I'm +1 for having internal/input/output/script encoding setting at PHP or Zend level.
If the default is the problem is the problem, we should set default_charset default to UTF-8 and use them as default for internal/input/output/script and functions that affected by encoding. When XSS advisory was released at Feb. 2000, it stated encoding MUST be specified in HTTP response header. Setting default_charset is the best practice for security perspective anyway. If we use default_charset as default encoding, transition to 5.4 might be easier. Regards, -- Yasuo Ohgaki yohg...@ohgaki.net 2012/8/24 Rasmus Lerdorf <ras...@lerdorf.com>: > htmlspecialchars(), htmlentities(), html_entity_decode() and > get_html_translation_table() all take an encoding parameter that used to > default to iso-8859-1. We changed the default in PHP 5.4 to UTF-8. This > is a much more sensible default and in the case of the encoding > functions more secure as it prevents invalid UTF-8 from getting through. > If you use 8859-1 as the default but your app is actually in UTF-8 or > worse, some encoding that isn't low-ascii compatible then > htmlspecialchars()/htmlentities() aren't doing what you think they are > and you have a glaring security hole in your app. > > However, people are understandably lazy and don't want to think about > this stuff. They don't want to explicitly provide their input encoding > to these calls. We provided a solution to this and a way to write > portable apps and that was to pass in an empty string "" as the > encoding. If we saw this we would set the input encoding to match the > output encoding specified by the "default_charset" ini setting. We > couldn't just default to this default_charset because input and output > encodings may very well be different and we would risk making existing > apps insecure. For example an app using BIG5/CJK for its output encoding > might very well be pulling data from 8859/UTF-8 data sources and if we > invisibly switched htmlspecialchars/entities to match their output > encoding we would have problems. Invisibly switching them from 8859-1 to > UTF-8 could still be problematic, but it at least it fails safe in that > it doesn't let invalid UTF-8 through and encodes low-ascii the same way > it did before. > > The problem is that there is a lot of legacy code out there that doesn't > explicitly set the encoding on those calls and it is a lot of work to go > through and specify it on each call. I still personally prefer to have > people be explicit here, but I think it is slowing 5.4 adoption (see bug > 61354). > > In PHP 6 we tried to introduce separate input, script and output > encoding settings. Currently in 5.4 we don't have that, but we have > those 3 separately for mbstring and for iconv: > > iconv.input_encoding > iconv.internal_encoding > iconv.output_encoding > mbstring.http_input > mbstring.internal_encoding > mbstring.http_output > > Ideally we should be getting rid of the per-feature encoding settings > and have a single set of them that we refer to when we need them. This > is one of these places where we really need a default input encoding > setting. We could have it check mbstring.http_input, but there is a > wrinkle here that it has a fancy "auto" setting which we don't really > want in this case. So we could set it to iconv.input_encoding, but that > seems rather random and unintuitive. > > So do we create a new default_input_encoding ini directive mid-stream in > 5.4 for this? Of course with the longer-term in mind that this will be > part of a unified set of encoding settings in 5.5 and beyond. > > -Rasmus > > -- > PHP Internals - PHP Runtime Development Mailing List > To unsubscribe, visit: http://www.php.net/unsub.php > -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: http://www.php.net/unsub.php