I wanted to reply generally to this and not to any person in
particular, as I'm the one who started the thread.

I used the rather broad title "Should All String Functions Become
Multi-Byte Safe" because there are many smaller related topics, but my
intention was to discuss multi-byte in general, and see if there was
some consensus on action items that could have a more limited scope/RFC
for that task.

My overall intent and goal was to make PHP safer against multi-byte
attacks by providing developers with tools that could become best
practices for dealing with user input stings, the same way we had
mysql_real_escape_string, and then PDO prepared statements for SQL.

There's a lot of potential pitfalls for dealing with Unicode input, and
there are some best practices per the Unicode Consortium that I'm not
sure how to implement in PHP, and it seems that since everyone needs
them, they might be better as a shared library in core.

For example, there should be a function that removes unassigned code
points.

There should also be a function that removes "scripts" (as defined by
Unicode).

We should have an easy way to remove private use code points (unless
you're running a Star Trek fan site and really do need Klingon).

And the default replacement character for `mb_scrub` shouldn't be `?`.

Each of these and other ideas could be part of an RFC, or we could
brainstorm a Unicode built-in class that handles lots of the common use
cases.

Having a team-built and audited Unicode class would benefit almost
everyone using PHP.

Reply via email to