.graphemes methods

Aaron Sherman Fri, 02 Jul 2004 13:50:21 -0700

On Tue, 2004-06-29 at 11:34, Austin Hastings wrote:

> [...] when you switch to LC_ALL= <pick your favorite
> language>, you just get really slow performance: Apparently the 'C'
> locale is such a totally special case that the performance of LC_ALL=C
> is one or more orders of magnitude better than LC_ALL=en_US.UTF-8, even
> when the data is 7bit ascii.


Well, of course. I can't imagine a way in which this would not be true.

After all, in LC_ALL="C" the number of characters in a string is equal
to the number of bytes in the string. In LC_ALL="en_US.UTF-8" the length
of a string is dependent on what exactly you mean by length, and a lot
of special cases arise. Special cases and context mean you have more
code to execute for the same logical task, which means you have more
processing to do.

Unicode support is expensive, even if you're just doing ASCII-as-UTF-8.
That doesn't mean it's a bad thing to do, it's just that it's expensive.

> I think that (1) this is unacceptable: the temptation to switch to the
> 'C' locale has been too great, both at this site and on a lot of the RH
> support forums; 

And yet, in English-speaking countries (and Hawaiian and
Swahili-speaking countries for that matter) and in situations where the
fidelity of certain types of string data (such as names) is not
considered critical, this is a fine default. e.g. for general shell
work.

> (2) Perl6 should equitably support all its target
> locales; (3) we should set out to make sure the performance is damn
> fast no matter what locale we're using.

Well, that's a nice theory, but you can prove that low-level encodings
(e.g. ASCII, EBCDIC) will be more efficient than high-level encodings
(e.g. UTF-8), so the only way to accomplish what you suggest in (2) is
to break (3) by slowing down the faster handling (not what you wanted,
I'm sure).

Of course, you want to have as much performance out of string handling
as possible.

> This has no direct bearing on p6l, since performance is a p6i issue.
> But perhaps in the interests of performance as well as hackery we
> should explicitly provide some sort of variant regex behavior:
> 
>     /a./ :bytes
>     /a./ :graphemes

As pointed out by others, this is already there, though I'm not sure
that it would be specified that way. More likely:

        m :u0 /a./
        [etc]

-- 
Aaron Sherman <[EMAIL PROTECTED]>
Senior Systems Engineer and Perl Toolsmith
http://www.ajs.com/~ajs/resume.html

Re: The .bytes/.codepoints/.graphemes methods

Reply via email to