On Sun, Apr 16, 2006 at 04:41:05PM -0500, Patrick R. Michaud wrote:

> I also realized this past week that using 'unicode:' on
> strings with \x (codepoints 128-255) may *still* be a bit 
> too liberal -- the « french angles » will still cause 
> "no ICU library present" errors, but would seemingly work
> just fine if iso-8859-1 is attempted.  I'm not wanting
> to block systems w/o ICU from working on Perl 6,
> so falling back to iso-8859-1 in this case seems like the 
> best of a bad situation.  (OTOH, there are some potential 
> problems with it on output.)

I haven't been near ICU for about a year, but last time I had dealings with
it, it wasn't horribly portable. Furthermore, it had set itself up for trouble
by having at least an n*m model of the world (compilers * operating systems)
rather than an n+m (treat compiler related features independently of operating
system related features). I was not impressed.

Although clearly ICU is a good enough solution for now, so I'm not suggesting
"burn it", even if it floats like a duck.

> Lastly, I suspect (and it's just a suspicion) that string
> operations on ASCII and iso-8859-1 strings are likely to be
> faster than their utf-8/unicode counterparts.  If this is
> true, then the more strings that we can keep in ASCII,
> the better off we are.  (And the vast majority of strings
> literals I seem to be generating in PIR contain only ASCII
> characters.)
> 
> One other option is to make string operations such as
> downcase a bit smarter and figure out that it's okay
> to use the iso-8859-1 or ASCII algorithms/tables when
> the strings involved don't have any codepoints above 255.

IIRC Jarkko's conclusion from having too much dealing with it in Perl 5 is
avoid UTF-8 like the plague. Variable length encodings are fine for
data exchange, but suck internally as soon as you want to manipulate them.
With hindsight, his view was that probably Perl 5 should have gone for
UCS-32 internally. (Oh, and don't repeat *the* Perl 5 Unicode fallacy,
assuming that 8 bit data the user gave you happens to be in ISO-8859-1 if
nothing told you this)

I think Dan was thinking that internally everything should be fixed width,
and for practical reasons pick the smallest of 8 bit, UCS-16 and UCS-32
internally. Convert variable width to fixed width (losslessly) the first time
you need to do anything to it, and leave it that way.

Specifically, even case conversion would be better done as fixed width, as
there's at least one character where 1 of uppercase/titlecase/lowercase has
a different width from the other 2. (That's before you get to special cases
such as Greek sigma)

Presumably therefore Dan's view was that while constants to the assembler
might well be fed in as UTF-8, the generated bytecode should be using the
tersest fixed width it can. I can see sense in this.

Nicholas Clark

Reply via email to