Bryan, “You hit the nail on the head here. You cannot black-list convert > ISO-8859-1 to UTF-8. However, when we talk about escaping, we're talking > about a context where the encoding is already correct, we're just > preventing special characters from imparting special meaning. In that case, > escaping is the correct way of handling it.”**** > > ** ** > > We can never safely assume that the encoding is correct. If the encoding > of the original data is different than the assumed encoding, characters > with “special meaning” may have different values and will be allowed > through. For a simple proof-of-concept, see > http://shiflett.org/blog/2005/dec/google-xss-example. Now, that is a > specific exploit for an underlying vulnerability. The vulnerability is the > fact that htmlentities() doesn’t decode the input before trying to escape > characters. >
Actually, in my mind, that's the role of filtering. You should filter the proper charset. Everything inside of the application should have a consistent character set. And if that's the case, these sorts of vulnerabilities (not to mention a whole host of possible bugs) are no longer possible... > What I’m trying to convey is that all context relevant to the operation > matters. In this case, if characters are compared/replaced at the > byte-level, we need to decode to the byte-level before performing those > operations. To take that further, It’s important for everyone to realize > that encoding doesn’t just apply to character sets; data is encoded for a > specific layer. This is the same problem that the TCP and ISO layers solved > decades ago; we’re just adding layers above the application layer. You > wouldn’t expect an HTML parser to be able to parse JavaScript because they > are different encodings. If you wanted to translate an HTML implementation > cleanly to a JavaScript implementation, you would have to decode the HTML > and then build a translator to build the same DOM elements in JavaScript. I > know that’s sort of a blurry line, but I need to wrap this up. Hopefully, > I’ve conveyed the idea.**** > > ** ** > > The sooner we all grasp this concept of encoding layers, the sooner this > problem of injection/scripting at every layer goes away. The solution: > Decode all inputs, halt execution on decoding errors, and then re-encode > them. Yes, this is going to add overhead. But where security is concerned, > we have to be willing to accept some overhead. > Again, that's the role of filtering. Inputs should never get to a presentation layer unfiltered. That's a bigger problem that needs to be addressed first. But I would concede that it's worth doing again at output to catch any issues. But those issues it catches should be seen as application bugs and not a caught attack vector... > Okay, with that out of the way, I’ll reiterate my agreement with your > statement, “I think it strongly depends upon the exact behavior of the > library. If we do wind up doing transcoding as well as escaping, then that > may be valid. If we don't, then it wouldn't.“**** > > ** ** > > If the aim of this API is to really tackle the problem, we need to go > beyond wrapping htmlentities() and htmlspecialchars() and change the names > to “encode”. If it’s just to maintain the status quo and leave it to > developers who barely understand encoding or escaping to ensure that their > entire stack is using the same encoding, then we should leave the name > as-is. > Just wrapping any library is often not a good idea. We'd need to add meaningful logic in addition to the namespace name change. So yes, I'm in favor of doing it right at that point... > The official PHP documentation discourages the use of > mysql_real_escape_string: > http://php.net/manual/en/function.mysql-real-escape-string.php. The > recommendation is to use a library that is character-set aware, like mysqli > or PDO. But note that even using mysqli_real_escape_string or PDO:quote > requires you to manually set the connection-level character-set. I’ve been > operating on the assumption (there I go assuming) that PDO prepared > statements were aware of the connection-level character set and mitigated > this problem; however, I just reviewed PDO’s source code and I’m starting > to question its implementation. As for your OWASP reference, keep in mind > that OWASP makes many tiers of recommendations. Notice that manually > escaping is the last option for mitigating injection problems. > In short, that's wrong (MRES is encouraged). But I've taken the reply off-list as it's off topic here. > In any case, I’m not here to carry on an endless flame war. I just want to > make sure that we’re doing what’s necessary to mitigate the number one > vulnerability in web applications. > I don't think this discussion is a flame war. I think it's a very good and constructive point that needs to be made. It's at least a whole lot more important and relevant than the last 40 posts on OOP vs Procedural names... Anthony