> Currently, PHP strings are binary safe (thus can store any encoding).
> I generally think of PHP strings as being an array of bytes vs. a
> "string" you are familiar with in other languages. The name is
> unfortunate in that regard, but working with them is straightforward
> (imagine having an actual array of bytes in PHP and trying to work on
> them).
> 

PHP was the first language I leaned to program in, followed by
JavaScript.

At that point, and for many years thereafter, I never thought of
strings as anything more than chunks of human readable text.

It wasn't until I started to learn C++ that my understanding of strings
changed. It was no longer "text", but a sequence of bytes.

The problem is, when people start out learning a higher-level language
like PHP, they don't start by understanding what the computer actually
stores in memory, or how data structures are represented internally.

They write:

    echo "hello world!';

...and give no thought to how the letters they typed become the letters
that come out.

Character encoding doesn't cross your mind until a spammer tries to
paste foreign characters into a contact form and your application
crashes.

And when you try to learn more, the most you find is advise to slap a
few "utf-8" stickers in some places like the HTML response header, the
charset meta tag, and to use some mb_* internal/output encoding
functions.

I've been looking for the past few weeks now, and I've asked on some
community groups as well, and I have been unable to find a good,
comprehensive security-minded guide for dealing with multi-byte
characters and character attacks in PHP.

There's general guidance from the Unicode Consortium on what should be
done, but no guides on how to implement their security recommendations
in PHP.

One report is: https://www.unicode.org/reports/tr36

There's several things in their guide.

They recommend that illegal byte sequences not be deleted as this can
create an attack vector where two bytes that fit together are split by
an illegal sequence, that, once removed, puts the two bytes back
together to make something new, *after* the program has checked for
dangerous characters:

https://www.unicode.org/reports/tr36/#SecureEncodingConversion


In PHP, you should be able to do that with:

$ScrubbedBody = mb_scrub($_POST['body'], 'UTF-8');

But there's a pitfall here!

By default, `mb_scrub` and several other PHP conversion functions
replace illegal byte sequences with a `?` instead of `U+FFFD`, the
designated replacement character.

A question mark is an important character with special meaning, and the
default implementation of `mb_scrub` will allow an attacker to put a
`?` anywhere they want by inserting illegal bytes where they want a
question mark inserted.

To get the correct behavior, a developer must know to call:

mb_substitute_character(0xFFFD);
$ScrubbedBody = mb_scrub($_POST['body'], 'UTF-8');

There's also some Unicode Consortium recommendations on sets of
characters that should be stripped from user input.

https://www.unicode.org/reports/tr36/#Recommendations_General 

The report says:

"Private use characters must be avoided in identifiers, except in
closed environments. There is no predicting what either the visual
display or the programmatic interpretation will be on any given
machine, so this can obviously lead to security problems."

They go on to say, "What is true for private use characters is doubly
true of unassigned code points. Secure systems will not use them: any
future Unicode Standard could assign those codepoints to any new
character. This is especially important in the case of certification."

But how do we remove these private use characters and unassigned code
points using PHP?

You can use `mb_ereg` or `preg` with `/u` to remove character ranges,
but this is clumsy at best.

The guide warns against trying to restrict characters by language, and
recommends using a "writing system" instead:

https://www.unicode.org/reports/tr36/#Language_Based_Security

"Creating "safe character sets" is an important goal in a security
context, and it would appear that the characters used in a language is
an obvious choice. However, because of the indeterminate set of
characters used for a language, it is typically more effective to move
to the higher level, the script, which can be more easily specified and
tested."

While I could probably hack together an array of regular expressions
for identifying white-listed (language) scripts, this seems like
something that should be built-in as a single function.

In any application that reflects text back to other users, securely
processing incoming Unicode is as important to stopping XSS attacks as
PDO prepared statements are to stopping SQL injection.

As for the second recommendation, removing "unassigned code points", I
have not even started to work out how to do this with PHP.

Since Unicode presents a security concern, I think it is important that
function behavior with regard to Unicode be well documented, and also,
that we have some functions that are easy to use to properly handle the
complexities of Unicode security.

Reply via email to