Re: [PHP-DEV] Proposal for better UTF-8 handling

Nicolas Grekas Mon, 27 May 2013 01:35:00 -0700

Btw, I hit a bug on grapheme_substr() that got no attention:
https://bugs.php.net/bug.php?id=62759


There is also https://bugs.php.net/bug.php?id=61860 that waits for a fix.

Nicolas



On Mon, May 27, 2013 at 8:40 AM, Pierre Joye <[email protected]> wrote:

> hi!
>
> On Fri, May 24, 2013 at 3:17 AM, Rouven Weßling <[email protected]>
> wrote:
> > Hi Internals!
> >
> > First let me introduce myself, my name is Rouven Weßling, I'm a student
> at RWTH Aachen University and I'm one of the maintainers of the Joomla!
> Framework (née Platform). I've been following the internals list for a few
> months and started brushing of my C skills for the past couple of months so
> I can start contributing.
> >
> > To me one of the most annoying things about working with PHP is the
> (lack of) unicode support. In Joomla! we've been discussing switching from
> PHP UTF-8 to Patchwork UTF-8 for our needs of handling UTF-8. Both are
> libraries abstracting the multibyte extension and supplementing it with a
> number of functions. They also provide userland replacements for when
> multibyte is not available (Patchwork will also use iconv and intl if
> available). All of this is a huge pain.
> >
> > To ease this situation I'd like to make a new start at better unicode
> support for PHP, this time focusing on UTF-8 as the dominant web encoding.
> As a first step I'd like to propose adding a set of functions for handling
> UTF-8 strings. This should keep applications from implementing these
> algorithms in PHP (also many of these are quite a bit faster, see benchmark
> results below). Once the algorithms are in place I'd like to look into
> creating a class for unicode strings and eventually Python like unicode
> literals.
> >
> > Before I write an RFC I'd like to get some feedback what you think about
> adding the following functions to PHP 5.6 (possibly more to follow):
> utf8_is_valid, utf8_strlen,  utf8_substr, utf8_strpos, utf8_strrpos,
> utf8_str_split, utf8_strrev, utf8_recover, utf8_chr, utf8_ord,
> string_is_ascii.
> >
> > Most of them (exceptions are utf8_chr, utf8_is_valid, utf8_recover and
> string_is_ascii) are currently written in a way that they emit a warning
> when they encounter invalid UTF-8 and return with null. This should
> encourage applications to check their input with utf8_is_valid and either
> stop further processing or to fall back to utf8_recover to get a valid
> string. This should improve security since there are attack vectors when
> malformed sequences get interpreted as another encoding.
> >
> > You can find the code I've written so far here:
> https://github.com/realityking/pecl-utf8
> > You can find benchmark results here:
> http://realityking.github.io/pecl-utf8/results.html
>
> Without judging your extension, I wonder if you have looked at the
> intl extension, for the php core parts. There are also some exts to
> deal with non ascii strings in pecl.
>
> I always promoted intl usage as it handles UTF-8 or other very well
> and for everything needed to fully support Unicode, their data is kept
> updated and the APIs are very stable. It is also available since PHP
> 5.3 which makes it a very good choice to begin with.
>
> Cheers,
> --
> Pierre
>
> @pierrejoye | http://www.libgd.org
>
> --
> PHP Internals - PHP Runtime Development Mailing List
> To unsubscribe, visit: http://www.php.net/unsub.php
>
>

Re: [PHP-DEV] Proposal for better UTF-8 handling

Reply via email to