Re: [PHP-DEV] Proposal for better UTF-8 handling

Nikita Popov Fri, 24 May 2013 06:10:15 -0700

On Fri, May 24, 2013 at 3:17 AM, Rouven Weßling <m...@rouvenwessling.de>wrote:


> Hi Internals!
>
> First let me introduce myself, my name is Rouven Weßling, I'm a student at
> RWTH Aachen University and I'm one of the maintainers of the Joomla!
> Framework (née Platform). I've been following the internals list for a few
> months and started brushing of my C skills for the past couple of months so
> I can start contributing.
>
> To me one of the most annoying things about working with PHP is the (lack
> of) unicode support. In Joomla! we've been discussing switching from PHP
> UTF-8 to Patchwork UTF-8 for our needs of handling UTF-8. Both are
> libraries abstracting the multibyte extension and supplementing it with a
> number of functions. They also provide userland replacements for when
> multibyte is not available (Patchwork will also use iconv and intl if
> available). All of this is a huge pain.
>
> To ease this situation I'd like to make a new start at better unicode
> support for PHP, this time focusing on UTF-8 as the dominant web encoding.
> As a first step I'd like to propose adding a set of functions for handling
> UTF-8 strings. This should keep applications from implementing these
> algorithms in PHP (also many of these are quite a bit faster, see benchmark
> results below). Once the algorithms are in place I'd like to look into
> creating a class for unicode strings and eventually Python like unicode
> literals.
>
> Before I write an RFC I'd like to get some feedback what you think about
> adding the following functions to PHP 5.6 (possibly more to follow):
> utf8_is_valid, utf8_strlen,  utf8_substr, utf8_strpos, utf8_strrpos,
> utf8_str_split, utf8_strrev, utf8_recover, utf8_chr, utf8_ord,
> string_is_ascii.
>
> Most of them (exceptions are utf8_chr, utf8_is_valid, utf8_recover and
> string_is_ascii) are currently written in a way that they emit a warning
> when they encounter invalid UTF-8 and return with null. This should
> encourage applications to check their input with utf8_is_valid and either
> stop further processing or to fall back to utf8_recover to get a valid
> string. This should improve security since there are attack vectors when
> malformed sequences get interpreted as another encoding.
>
> You can find the code I've written so far here:
> https://github.com/realityking/pecl-utf8
> You can find benchmark results here:
> http://realityking.github.io/pecl-utf8/results.html
>
> Best regards
> Rouven
>

We already have a lot of functions for multibyte string handling. Let me
list a few:

 * The str* functions. Most of them are safe for usage with UTF8.
Exceptions are basically everything where you manually provide an offset,
e.g. writing substr($str, 0, 100) is not safe. substr($str, 0, strpos($str,
'xyz')) on the other hand is.
 * The mb* functions. They work with various encodings and usually make of
of character offsets and lengths rather than byte offsets and lengths. They
are not necessary most of the time, but useful for the aforementioned
substr call with hardcoded offsets.
 * The Intl extension. This give you *real* unicode support, as in
collations, locales, transliteration, etc.
 * The grapheme* functions which are also part of intl. The work with
grapheme cluster offsets and lengths.

Anyway, my point is that just adding *yet another* set of string functions
won't solve anything, just make things even more complicated than they
already are. I'm not strictly opposed to adding more functions if they are
necessary, but one has to be aware of what there already is and how the new
functions will integrate.

Nikita

Re: [PHP-DEV] Proposal for better UTF-8 handling

Reply via email to