Btw, I hit a bug on grapheme_substr() that got no attention: https://bugs.php.net/bug.php?id=62759
There is also https://bugs.php.net/bug.php?id=61860 that waits for a fix. Nicolas On Mon, May 27, 2013 at 8:40 AM, Pierre Joye <pierre....@gmail.com> wrote: > hi! > > On Fri, May 24, 2013 at 3:17 AM, Rouven Weßling <m...@rouvenwessling.de> > wrote: > > Hi Internals! > > > > First let me introduce myself, my name is Rouven Weßling, I'm a student > at RWTH Aachen University and I'm one of the maintainers of the Joomla! > Framework (née Platform). I've been following the internals list for a few > months and started brushing of my C skills for the past couple of months so > I can start contributing. > > > > To me one of the most annoying things about working with PHP is the > (lack of) unicode support. In Joomla! we've been discussing switching from > PHP UTF-8 to Patchwork UTF-8 for our needs of handling UTF-8. Both are > libraries abstracting the multibyte extension and supplementing it with a > number of functions. They also provide userland replacements for when > multibyte is not available (Patchwork will also use iconv and intl if > available). All of this is a huge pain. > > > > To ease this situation I'd like to make a new start at better unicode > support for PHP, this time focusing on UTF-8 as the dominant web encoding. > As a first step I'd like to propose adding a set of functions for handling > UTF-8 strings. This should keep applications from implementing these > algorithms in PHP (also many of these are quite a bit faster, see benchmark > results below). Once the algorithms are in place I'd like to look into > creating a class for unicode strings and eventually Python like unicode > literals. > > > > Before I write an RFC I'd like to get some feedback what you think about > adding the following functions to PHP 5.6 (possibly more to follow): > utf8_is_valid, utf8_strlen, utf8_substr, utf8_strpos, utf8_strrpos, > utf8_str_split, utf8_strrev, utf8_recover, utf8_chr, utf8_ord, > string_is_ascii. > > > > Most of them (exceptions are utf8_chr, utf8_is_valid, utf8_recover and > string_is_ascii) are currently written in a way that they emit a warning > when they encounter invalid UTF-8 and return with null. This should > encourage applications to check their input with utf8_is_valid and either > stop further processing or to fall back to utf8_recover to get a valid > string. This should improve security since there are attack vectors when > malformed sequences get interpreted as another encoding. > > > > You can find the code I've written so far here: > https://github.com/realityking/pecl-utf8 > > You can find benchmark results here: > http://realityking.github.io/pecl-utf8/results.html > > Without judging your extension, I wonder if you have looked at the > intl extension, for the php core parts. There are also some exts to > deal with non ascii strings in pecl. > > I always promoted intl usage as it handles UTF-8 or other very well > and for everything needed to fully support Unicode, their data is kept > updated and the APIs are very stable. It is also available since PHP > 5.3 which makes it a very good choice to begin with. > > Cheers, > -- > Pierre > > @pierrejoye | http://www.libgd.org > > -- > PHP Internals - PHP Runtime Development Mailing List > To unsubscribe, visit: http://www.php.net/unsub.php > >