Re: [PHP-DEV] Proposal for better UTF-8 handling

Pierre Joye Sun, 26 May 2013 23:41:06 -0700

hi!

On Fri, May 24, 2013 at 3:17 AM, Rouven Weßling <[email protected]> wrote:
> Hi Internals!
>
> First let me introduce myself, my name is Rouven Weßling, I'm a student at 
> RWTH Aachen University and I'm one of the maintainers of the Joomla! 
> Framework (née Platform). I've been following the internals list for a few 
> months and started brushing of my C skills for the past couple of months so I 
> can start contributing.
>
> To me one of the most annoying things about working with PHP is the (lack of) 
> unicode support. In Joomla! we've been discussing switching from PHP UTF-8 to 
> Patchwork UTF-8 for our needs of handling UTF-8. Both are libraries 
> abstracting the multibyte extension and supplementing it with a number of 
> functions. They also provide userland replacements for when multibyte is not 
> available (Patchwork will also use iconv and intl if available). All of this 
> is a huge pain.
>
> To ease this situation I'd like to make a new start at better unicode support 
> for PHP, this time focusing on UTF-8 as the dominant web encoding. As a first 
> step I'd like to propose adding a set of functions for handling UTF-8 
> strings. This should keep applications from implementing these algorithms in 
> PHP (also many of these are quite a bit faster, see benchmark results below). 
> Once the algorithms are in place I'd like to look into creating a class for 
> unicode strings and eventually Python like unicode literals.
>
> Before I write an RFC I'd like to get some feedback what you think about 
> adding the following functions to PHP 5.6 (possibly more to follow): 
> utf8_is_valid, utf8_strlen,  utf8_substr, utf8_strpos, utf8_strrpos, 
> utf8_str_split, utf8_strrev, utf8_recover, utf8_chr, utf8_ord, 
> string_is_ascii.
>
> Most of them (exceptions are utf8_chr, utf8_is_valid, utf8_recover and 
> string_is_ascii) are currently written in a way that they emit a warning when 
> they encounter invalid UTF-8 and return with null. This should encourage 
> applications to check their input with utf8_is_valid and either stop further 
> processing or to fall back to utf8_recover to get a valid string. This should 
> improve security since there are attack vectors when malformed sequences get 
> interpreted as another encoding.
>
> You can find the code I've written so far here: 
> https://github.com/realityking/pecl-utf8
> You can find benchmark results here: 
> http://realityking.github.io/pecl-utf8/results.html


Without judging your extension, I wonder if you have looked at the
intl extension, for the php core parts. There are also some exts to
deal with non ascii strings in pecl.

I always promoted intl usage as it handles UTF-8 or other very well
and for everything needed to fully support Unicode, their data is kept
updated and the APIs are very stable. It is also available since PHP
5.3 which makes it a very good choice to begin with.

Cheers,
--
Pierre

@pierrejoye | http://www.libgd.org

--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php

Re: [PHP-DEV] Proposal for better UTF-8 handling

Reply via email to