Anthony,

 

I'll run through some of the semantics related to your response in case it
provides any insight that would encourage a direction change for this
functionality. But overall, I agree 100% with your closing line, "I think it
strongly depends upon the exact behavior of the library. If we do wind up
doing transcoding as well as escaping, then that may be valid. If we don't,
then it wouldn't."

 

Now, onto the nitty-gritty.

 

"You hit the nail on the head here. You cannot black-list convert ISO-8859-1
to UTF-8. However, when we talk about escaping, we're talking about a
context where the encoding is already correct, we're just preventing special
characters from imparting special meaning. In that case, escaping is the
correct way of handling it."

 

We can never safely assume that the encoding is correct. If the encoding of
the original data is different than the assumed encoding, characters with
"special meaning" may have different values and will be allowed through. For
a simple proof-of-concept, see
http://shiflett.org/blog/2005/dec/google-xss-example.  Now, that is a
specific exploit for an underlying vulnerability. The vulnerability is the
fact that htmlentities() doesn't decode the input before trying to escape
characters.

 

"But if you wanted to output arbitrary input into a UTF-8 document, you
would also need to ensure that it's encoded properly into UTF-8. So I can
see your distinction applying to that case. But from a different angle."

 

This is getting closer to the root of the problem.

 

"Escaping preserves the security context. Encoding preserves the semantic
context. You could escape away all invalid UTF-8 bytes, but you'd loose the
meaning of the original character set. So semantically, encoding is
necessary. But from a security perspective, the encoding doesn't really
matter much. What matters is the security context (not injecting harmful
code, etc)."

 

What I'm trying to convey is that all context relevant to the operation
matters. In this case, if characters are compared/replaced at the
byte-level, we need to decode to the byte-level before performing those
operations. To take that further, It's important for everyone to realize
that encoding doesn't just apply to character sets; data is encoded for a
specific layer. This is the same problem that the TCP and ISO layers solved
decades ago; we're just adding layers above the application layer. You
wouldn't expect an HTML parser to be able to parse JavaScript because they
are different encodings. If you wanted to translate an HTML implementation
cleanly to a JavaScript implementation, you would have to decode the HTML
and then build a translator to build the same DOM elements in JavaScript. I
know that's sort of a blurry line, but I need to wrap this up. Hopefully,
I've conveyed the idea.

 

The sooner we all grasp this concept of encoding layers, the sooner this
problem of injection/scripting at every layer goes away. The solution:
Decode all inputs, halt execution on decoding errors,  and then re-encode
them. Yes, this is going to add overhead. But where security is concerned,
we have to be willing to accept some overhead.

 

Okay, with that out of the way, I'll reiterate my agreement with your
statement, "I think it strongly depends upon the exact behavior of the
library. If we do wind up doing transcoding as well as escaping, then that
may be valid. If we don't, then it wouldn't."

 

If the aim of this API is to really tackle the problem, we need to go beyond
wrapping htmlentities() and htmlspecialchars() and change the names to
"encode". If it's just to maintain the status quo and leave it to developers
who barely understand encoding or escaping to ensure that their entire stack
is using the same encoding, then we should leave the name as-is.

 

How is mres fundamentally flawed? And how is it discouraged? It's actually
listed as a valid defense by OWASP:
https://www.owasp.org/index.php/SQL_Injection_Prevention_Cheat_Sheet#Defense
_Option_3:_Escaping_All_User_Supplied_Input

 

The official PHP documentation discourages the use of
mysql_real_escape_string:
http://php.net/manual/en/function.mysql-real-escape-string.php. The
recommendation is to use a library that is character-set aware, like mysqli
or PDO. But note that even using mysqli_real_escape_string or PDO:quote
requires you to manually set the connection-level character-set. I've been
operating on the assumption (there I go assuming) that PDO prepared
statements were aware of the connection-level character set and mitigated
this problem; however, I just reviewed PDO's source code and I'm starting to
question its implementation. As for your OWASP reference, keep in mind that
OWASP makes many tiers of recommendations. Notice that manually escaping is
the last option for mitigating injection problems.

 

In any case, I'm not here to carry on an endless flame war. I just want to
make sure that we're doing what's necessary to mitigate the number one
vulnerability in web applications.

 

Thanks,

Bryan

 

From: Anthony Ferrara [mailto:ircmax...@gmail.com] 
Sent: Tuesday, September 18, 2012 2:12 PM
To: Bryan C. Geraghty
Cc: internals@lists.php.net
Subject: Re: [PHP-DEV] RFC: Implementing a core anti-XSS escaping class

 

Bryan,

On Tue, Sep 18, 2012 at 2:52 PM, Bryan C. Geraghty <br...@ravensight.org>
wrote:

Antony,

 

I'll concede that the term "escaping" is improperly used in many places;
even in the OWASP documentation.

 

But I'll point out that the CWE document is identifying a distinction in the
two terms by saying,  "This overlapping usage extends to the Web, such as
the "escape" JavaScript function whose purpose is stated to be encoding".

 

There is a distinction between them. But in this case it's not particularly
relevant (as both work quite fine). I'll elaborate further in a second.

  

But when you say, "With the end result being the exact same...", I don't
think you've thought it through. I've read some of your stuff and I'm pretty
confident that you understand the benefits of white-listing over
black-listing. For the uninitiated, yes, a black-list can be configured to
produce the same results at a given point-in-time, but the fundamental
approach is different.  A white-list operates on an explicit specification
and lets nothing else through. A black-list assumes that the input data is
mostly correct and it filters out the bad. To add to that, how do you
convert from ISO-8859-1 to UTF-8 with a black-list or by escaping?

 

You hit the nail on the head here. You cannot black-list convert ISO-8859-1
to UTF-8. However, when we talk about escaping, we're talking about a
context where the encoding is already correct, we're just preventing special
characters from imparting special meaning. In that case, escaping is the
correct way of handling it.

 

But if you wanted to output arbitrary input into a UTF-8 document, you would
also need to ensure that it's encoded properly into UTF-8. So I can see your
distinction applying to that case. But from a different angle.

 

Escaping preserves the security context. Encoding preserves the semantic
context. You could escape away all invalid UTF-8 bytes, but you'd loose the
meaning of the original character set. So semantically, encoding is
necessary. But from a security perspective, the encoding doesn't really
matter much. What matters is the security context (not injecting harmful
code, etc). 

 

Now, both can be handled by the same routine. But that's not necessary to
preserve the security aspect. And that's why I objected to using the term
"encoding" here. If we want to go that route, that's fine. But you don't
need to encode for security. Escaping will handle that (possibly at the
expense of invalid semantic meaning).

 

Your reference to mysql_real_escape_string is exactly the point I'm trying
to make. The use of that function is "discouraged" because it DID escape; it
looked for specific bad characters. It was fundamentally flawed. And that is
the functionality PHP developers, as you just demonstrated, will refer to.
The current recommendation is to use a library that properly encodes the
entire data stream.

 

How is mres fundamentally flawed? And how is it discouraged? It's actually
listed as a valid defense by OWASP:
https://www.owasp.org/index.php/SQL_Injection_Prevention_Cheat_Sheet#Defense
_Option_3:_Escaping_All_User_Supplied_Input

 

The only 2 ways of securely getting data to MySQL is either by escaping, or
binding as a parameter on a prepared statement. Neither of which encodes a
data stream (the PS uses a binary format that puts the data in plain binary
form, as is, with a header to identify length).

 

Black listing works fine for a specified format (like XML, like HTML, like
SQL, like JavaScript). Where you get in trouble with black lists is when
your data format isn't specified (hence edge-cases aren't well known) or
when you're not serializing to a format (generic input black lists). But for
escaping output, black lists are a very well known, well understood, and
easily implemented approach.

  

I'll also agree that consistency with the industry is not as important
because there seem to be plenty of misuses. However, I do think that we
should use terminology that sets the functionality apart. So, given the
operating mode difference and the precedent set by mysql_escape_string,
mysql_real_escape_string, etc., I think "encode" is the way to go.

 

I think it strongly depends upon the exact behavior of the library. If we do
wind up doing transcoding as well as escaping, then that may be valid. If we
don't, then it wouldn't.

 

But I think we can both agree on the need...

 

Anthony

Reply via email to