Re: [PHP-DEV] Proposal for better UTF-8 handling

Ferenc Kovacs Fri, 24 May 2013 08:28:22 -0700

On Fri, May 24, 2013 at 3:09 PM, Nikita Popov <[email protected]> wrote:


> On Fri, May 24, 2013 at 3:17 AM, Rouven Weßling <[email protected]
> >wrote:
>
> > Hi Internals!
> >
> > First let me introduce myself, my name is Rouven Weßling, I'm a student
> at
> > RWTH Aachen University and I'm one of the maintainers of the Joomla!
> > Framework (née Platform). I've been following the internals list for a
> few
> > months and started brushing of my C skills for the past couple of months
> so
> > I can start contributing.
> >
> > To me one of the most annoying things about working with PHP is the (lack
> > of) unicode support. In Joomla! we've been discussing switching from PHP
> > UTF-8 to Patchwork UTF-8 for our needs of handling UTF-8. Both are
> > libraries abstracting the multibyte extension and supplementing it with a
> > number of functions. They also provide userland replacements for when
> > multibyte is not available (Patchwork will also use iconv and intl if
> > available). All of this is a huge pain.
> >
> > To ease this situation I'd like to make a new start at better unicode
> > support for PHP, this time focusing on UTF-8 as the dominant web
> encoding.
> > As a first step I'd like to propose adding a set of functions for
> handling
> > UTF-8 strings. This should keep applications from implementing these
> > algorithms in PHP (also many of these are quite a bit faster, see
> benchmark
> > results below). Once the algorithms are in place I'd like to look into
> > creating a class for unicode strings and eventually Python like unicode
> > literals.
> >
> > Before I write an RFC I'd like to get some feedback what you think about
> > adding the following functions to PHP 5.6 (possibly more to follow):
> > utf8_is_valid, utf8_strlen,  utf8_substr, utf8_strpos, utf8_strrpos,
> > utf8_str_split, utf8_strrev, utf8_recover, utf8_chr, utf8_ord,
> > string_is_ascii.
> >
> > Most of them (exceptions are utf8_chr, utf8_is_valid, utf8_recover and
> > string_is_ascii) are currently written in a way that they emit a warning
> > when they encounter invalid UTF-8 and return with null. This should
> > encourage applications to check their input with utf8_is_valid and either
> > stop further processing or to fall back to utf8_recover to get a valid
> > string. This should improve security since there are attack vectors when
> > malformed sequences get interpreted as another encoding.
> >
> > You can find the code I've written so far here:
> > https://github.com/realityking/pecl-utf8
> > You can find benchmark results here:
> > http://realityking.github.io/pecl-utf8/results.html
> >
> > Best regards
> > Rouven
> >
>
> We already have a lot of functions for multibyte string handling. Let me
> list a few:
>
>  * The str* functions. Most of them are safe for usage with UTF8.
> Exceptions are basically everything where you manually provide an offset,
> e.g. writing substr($str, 0, 100) is not safe. substr($str, 0, strpos($str,
> 'xyz')) on the other hand is.
>  * The mb* functions. They work with various encodings and usually make of
> of character offsets and lengths rather than byte offsets and lengths. They
> are not necessary most of the time, but useful for the aforementioned
> substr call with hardcoded offsets.
>  * The Intl extension. This give you *real* unicode support, as in
> collations, locales, transliteration, etc.
>  * The grapheme* functions which are also part of intl. The work with
> grapheme cluster offsets and lengths.
>
> Anyway, my point is that just adding *yet another* set of string functions
> won't solve anything, just make things even more complicated than they
> already are. I'm not strictly opposed to adding more functions if they are
> necessary, but one has to be aware of what there already is and how the new
> functions will integrate.
>
> Nikita
>

did you just forgot the pcre functions with the /u modifier?!?!
:P

-- 
Ferenc Kovács
@Tyr43l - http://tyrael.hu

Re: [PHP-DEV] Proposal for better UTF-8 handling

Reply via email to