On Fri, May 24, 2013 at 3:17 AM, Rouven Weßling <m...@rouvenwessling.de>wrote:
> Hi Internals! > > First let me introduce myself, my name is Rouven Weßling, I'm a student at > RWTH Aachen University and I'm one of the maintainers of the Joomla! > Framework (née Platform). I've been following the internals list for a few > months and started brushing of my C skills for the past couple of months so > I can start contributing. > > To me one of the most annoying things about working with PHP is the (lack > of) unicode support. In Joomla! we've been discussing switching from PHP > UTF-8 to Patchwork UTF-8 for our needs of handling UTF-8. Both are > libraries abstracting the multibyte extension and supplementing it with a > number of functions. They also provide userland replacements for when > multibyte is not available (Patchwork will also use iconv and intl if > available). All of this is a huge pain. > > To ease this situation I'd like to make a new start at better unicode > support for PHP, this time focusing on UTF-8 as the dominant web encoding. > As a first step I'd like to propose adding a set of functions for handling > UTF-8 strings. This should keep applications from implementing these > algorithms in PHP (also many of these are quite a bit faster, see benchmark > results below). Once the algorithms are in place I'd like to look into > creating a class for unicode strings and eventually Python like unicode > literals. > > Before I write an RFC I'd like to get some feedback what you think about > adding the following functions to PHP 5.6 (possibly more to follow): > utf8_is_valid, utf8_strlen, utf8_substr, utf8_strpos, utf8_strrpos, > utf8_str_split, utf8_strrev, utf8_recover, utf8_chr, utf8_ord, > string_is_ascii. > > Most of them (exceptions are utf8_chr, utf8_is_valid, utf8_recover and > string_is_ascii) are currently written in a way that they emit a warning > when they encounter invalid UTF-8 and return with null. This should > encourage applications to check their input with utf8_is_valid and either > stop further processing or to fall back to utf8_recover to get a valid > string. This should improve security since there are attack vectors when > malformed sequences get interpreted as another encoding. > > You can find the code I've written so far here: > https://github.com/realityking/pecl-utf8 > You can find benchmark results here: > http://realityking.github.io/pecl-utf8/results.html > > Best regards > Rouven > We already have a lot of functions for multibyte string handling. Let me list a few: * The str* functions. Most of them are safe for usage with UTF8. Exceptions are basically everything where you manually provide an offset, e.g. writing substr($str, 0, 100) is not safe. substr($str, 0, strpos($str, 'xyz')) on the other hand is. * The mb* functions. They work with various encodings and usually make of of character offsets and lengths rather than byte offsets and lengths. They are not necessary most of the time, but useful for the aforementioned substr call with hardcoded offsets. * The Intl extension. This give you *real* unicode support, as in collations, locales, transliteration, etc. * The grapheme* functions which are also part of intl. The work with grapheme cluster offsets and lengths. Anyway, my point is that just adding *yet another* set of string functions won't solve anything, just make things even more complicated than they already are. I'm not strictly opposed to adding more functions if they are necessary, but one has to be aware of what there already is and how the new functions will integrate. Nikita