On Fri, May 24, 2013 at 3:09 PM, Nikita Popov <nikita....@gmail.com> wrote:
> On Fri, May 24, 2013 at 3:17 AM, Rouven Weßling <m...@rouvenwessling.de > >wrote: > > > Hi Internals! > > > > First let me introduce myself, my name is Rouven Weßling, I'm a student > at > > RWTH Aachen University and I'm one of the maintainers of the Joomla! > > Framework (née Platform). I've been following the internals list for a > few > > months and started brushing of my C skills for the past couple of months > so > > I can start contributing. > > > > To me one of the most annoying things about working with PHP is the (lack > > of) unicode support. In Joomla! we've been discussing switching from PHP > > UTF-8 to Patchwork UTF-8 for our needs of handling UTF-8. Both are > > libraries abstracting the multibyte extension and supplementing it with a > > number of functions. They also provide userland replacements for when > > multibyte is not available (Patchwork will also use iconv and intl if > > available). All of this is a huge pain. > > > > To ease this situation I'd like to make a new start at better unicode > > support for PHP, this time focusing on UTF-8 as the dominant web > encoding. > > As a first step I'd like to propose adding a set of functions for > handling > > UTF-8 strings. This should keep applications from implementing these > > algorithms in PHP (also many of these are quite a bit faster, see > benchmark > > results below). Once the algorithms are in place I'd like to look into > > creating a class for unicode strings and eventually Python like unicode > > literals. > > > > Before I write an RFC I'd like to get some feedback what you think about > > adding the following functions to PHP 5.6 (possibly more to follow): > > utf8_is_valid, utf8_strlen, utf8_substr, utf8_strpos, utf8_strrpos, > > utf8_str_split, utf8_strrev, utf8_recover, utf8_chr, utf8_ord, > > string_is_ascii. > > > > Most of them (exceptions are utf8_chr, utf8_is_valid, utf8_recover and > > string_is_ascii) are currently written in a way that they emit a warning > > when they encounter invalid UTF-8 and return with null. This should > > encourage applications to check their input with utf8_is_valid and either > > stop further processing or to fall back to utf8_recover to get a valid > > string. This should improve security since there are attack vectors when > > malformed sequences get interpreted as another encoding. > > > > You can find the code I've written so far here: > > https://github.com/realityking/pecl-utf8 > > You can find benchmark results here: > > http://realityking.github.io/pecl-utf8/results.html > > > > Best regards > > Rouven > > > > We already have a lot of functions for multibyte string handling. Let me > list a few: > > * The str* functions. Most of them are safe for usage with UTF8. > Exceptions are basically everything where you manually provide an offset, > e.g. writing substr($str, 0, 100) is not safe. substr($str, 0, strpos($str, > 'xyz')) on the other hand is. > * The mb* functions. They work with various encodings and usually make of > of character offsets and lengths rather than byte offsets and lengths. They > are not necessary most of the time, but useful for the aforementioned > substr call with hardcoded offsets. > * The Intl extension. This give you *real* unicode support, as in > collations, locales, transliteration, etc. > * The grapheme* functions which are also part of intl. The work with > grapheme cluster offsets and lengths. > > Anyway, my point is that just adding *yet another* set of string functions > won't solve anything, just make things even more complicated than they > already are. I'm not strictly opposed to adding more functions if they are > necessary, but one has to be aware of what there already is and how the new > functions will integrate. > > Nikita > did you just forgot the pcre functions with the /u modifier?!?! :P -- Ferenc Kovács @Tyr43l - http://tyrael.hu