Anthony,
I'll run through some of the semantics related to your response in case it provides any insight that would encourage a direction change for this functionality. But overall, I agree 100% with your closing line, "I think it strongly depends upon the exact behavior of the library. If we do wind up doing transcoding as well as escaping, then that may be valid. If we don't, then it wouldn't." Now, onto the nitty-gritty. "You hit the nail on the head here. You cannot black-list convert ISO-8859-1 to UTF-8. However, when we talk about escaping, we're talking about a context where the encoding is already correct, we're just preventing special characters from imparting special meaning. In that case, escaping is the correct way of handling it." We can never safely assume that the encoding is correct. If the encoding of the original data is different than the assumed encoding, characters with "special meaning" may have different values and will be allowed through. For a simple proof-of-concept, see http://shiflett.org/blog/2005/dec/google-xss-example. Now, that is a specific exploit for an underlying vulnerability. The vulnerability is the fact that htmlentities() doesn't decode the input before trying to escape characters. "But if you wanted to output arbitrary input into a UTF-8 document, you would also need to ensure that it's encoded properly into UTF-8. So I can see your distinction applying to that case. But from a different angle." This is getting closer to the root of the problem. "Escaping preserves the security context. Encoding preserves the semantic context. You could escape away all invalid UTF-8 bytes, but you'd loose the meaning of the original character set. So semantically, encoding is necessary. But from a security perspective, the encoding doesn't really matter much. What matters is the security context (not injecting harmful code, etc)." What I'm trying to convey is that all context relevant to the operation matters. In this case, if characters are compared/replaced at the byte-level, we need to decode to the byte-level before performing those operations. To take that further, It's important for everyone to realize that encoding doesn't just apply to character sets; data is encoded for a specific layer. This is the same problem that the TCP and ISO layers solved decades ago; we're just adding layers above the application layer. You wouldn't expect an HTML parser to be able to parse JavaScript because they are different encodings. If you wanted to translate an HTML implementation cleanly to a JavaScript implementation, you would have to decode the HTML and then build a translator to build the same DOM elements in JavaScript. I know that's sort of a blurry line, but I need to wrap this up. Hopefully, I've conveyed the idea. The sooner we all grasp this concept of encoding layers, the sooner this problem of injection/scripting at every layer goes away. The solution: Decode all inputs, halt execution on decoding errors, and then re-encode them. Yes, this is going to add overhead. But where security is concerned, we have to be willing to accept some overhead. Okay, with that out of the way, I'll reiterate my agreement with your statement, "I think it strongly depends upon the exact behavior of the library. If we do wind up doing transcoding as well as escaping, then that may be valid. If we don't, then it wouldn't." If the aim of this API is to really tackle the problem, we need to go beyond wrapping htmlentities() and htmlspecialchars() and change the names to "encode". If it's just to maintain the status quo and leave it to developers who barely understand encoding or escaping to ensure that their entire stack is using the same encoding, then we should leave the name as-is. How is mres fundamentally flawed? And how is it discouraged? It's actually listed as a valid defense by OWASP: https://www.owasp.org/index.php/SQL_Injection_Prevention_Cheat_Sheet#Defense _Option_3:_Escaping_All_User_Supplied_Input The official PHP documentation discourages the use of mysql_real_escape_string: http://php.net/manual/en/function.mysql-real-escape-string.php. The recommendation is to use a library that is character-set aware, like mysqli or PDO. But note that even using mysqli_real_escape_string or PDO:quote requires you to manually set the connection-level character-set. I've been operating on the assumption (there I go assuming) that PDO prepared statements were aware of the connection-level character set and mitigated this problem; however, I just reviewed PDO's source code and I'm starting to question its implementation. As for your OWASP reference, keep in mind that OWASP makes many tiers of recommendations. Notice that manually escaping is the last option for mitigating injection problems. In any case, I'm not here to carry on an endless flame war. I just want to make sure that we're doing what's necessary to mitigate the number one vulnerability in web applications. Thanks, Bryan From: Anthony Ferrara [mailto:ircmax...@gmail.com] Sent: Tuesday, September 18, 2012 2:12 PM To: Bryan C. Geraghty Cc: internals@lists.php.net Subject: Re: [PHP-DEV] RFC: Implementing a core anti-XSS escaping class Bryan, On Tue, Sep 18, 2012 at 2:52 PM, Bryan C. Geraghty <br...@ravensight.org> wrote: Antony, I'll concede that the term "escaping" is improperly used in many places; even in the OWASP documentation. But I'll point out that the CWE document is identifying a distinction in the two terms by saying, "This overlapping usage extends to the Web, such as the "escape" JavaScript function whose purpose is stated to be encoding". There is a distinction between them. But in this case it's not particularly relevant (as both work quite fine). I'll elaborate further in a second. But when you say, "With the end result being the exact same...", I don't think you've thought it through. I've read some of your stuff and I'm pretty confident that you understand the benefits of white-listing over black-listing. For the uninitiated, yes, a black-list can be configured to produce the same results at a given point-in-time, but the fundamental approach is different. A white-list operates on an explicit specification and lets nothing else through. A black-list assumes that the input data is mostly correct and it filters out the bad. To add to that, how do you convert from ISO-8859-1 to UTF-8 with a black-list or by escaping? You hit the nail on the head here. You cannot black-list convert ISO-8859-1 to UTF-8. However, when we talk about escaping, we're talking about a context where the encoding is already correct, we're just preventing special characters from imparting special meaning. In that case, escaping is the correct way of handling it. But if you wanted to output arbitrary input into a UTF-8 document, you would also need to ensure that it's encoded properly into UTF-8. So I can see your distinction applying to that case. But from a different angle. Escaping preserves the security context. Encoding preserves the semantic context. You could escape away all invalid UTF-8 bytes, but you'd loose the meaning of the original character set. So semantically, encoding is necessary. But from a security perspective, the encoding doesn't really matter much. What matters is the security context (not injecting harmful code, etc). Now, both can be handled by the same routine. But that's not necessary to preserve the security aspect. And that's why I objected to using the term "encoding" here. If we want to go that route, that's fine. But you don't need to encode for security. Escaping will handle that (possibly at the expense of invalid semantic meaning). Your reference to mysql_real_escape_string is exactly the point I'm trying to make. The use of that function is "discouraged" because it DID escape; it looked for specific bad characters. It was fundamentally flawed. And that is the functionality PHP developers, as you just demonstrated, will refer to. The current recommendation is to use a library that properly encodes the entire data stream. How is mres fundamentally flawed? And how is it discouraged? It's actually listed as a valid defense by OWASP: https://www.owasp.org/index.php/SQL_Injection_Prevention_Cheat_Sheet#Defense _Option_3:_Escaping_All_User_Supplied_Input The only 2 ways of securely getting data to MySQL is either by escaping, or binding as a parameter on a prepared statement. Neither of which encodes a data stream (the PS uses a binary format that puts the data in plain binary form, as is, with a header to identify length). Black listing works fine for a specified format (like XML, like HTML, like SQL, like JavaScript). Where you get in trouble with black lists is when your data format isn't specified (hence edge-cases aren't well known) or when you're not serializing to a format (generic input black lists). But for escaping output, black lists are a very well known, well understood, and easily implemented approach. I'll also agree that consistency with the industry is not as important because there seem to be plenty of misuses. However, I do think that we should use terminology that sets the functionality apart. So, given the operating mode difference and the precedent set by mysql_escape_string, mysql_real_escape_string, etc., I think "encode" is the way to go. I think it strongly depends upon the exact behavior of the library. If we do wind up doing transcoding as well as escaping, then that may be valid. If we don't, then it wouldn't. But I think we can both agree on the need... Anthony