Re: [perl #48108] [BUG] downcase opcode fails on unicode strings w/o icu

chromatic Wed, 05 Dec 2007 11:33:59 -0800

On Tuesday 04 December 2007 08:14:41 Patrick R.Michaud wrote:

> If ICU isn't present, Parrot's C<downcase> opcode always throws
> an exception.  It does this even if the string contains codepoints
> only in the ascii and/or iso-8859-1 range.
>
> For example:
>
>     $ cat x.pir
>     .sub main :main
>         $S0 = unicode:"hello world"
>         $S1 = downcase $S0
>         say $S1
>     .end
>
>     $ ./parrot x.pir
>     no ICU lib loaded
>     current instr.: 'main' pc 3 (x.pir:3)
>
> This may cause a problem for Perl 6 programs, since the source
> code is always read as Unicode, and particularly affects the
> C< « > and C< » > characters (codepoints U+00ab and U+00bb).
>
> So far the major place I've run into this is in PGE, and I have
> a workaround there [1], but it will certainly crop up in many
> other places as we get more Perl 6 programs going.
>
> Pm
>
> [1]  PGE only has to downcase a single character at a time,
>      so instead of doing "$S1 = downcase $S0" it can cheat with
>
>          $I0 = ord $S0
>          $S1 = chr $I0
>          $S1 = downcase $S1
>
>      This works because chr with codepoints < 256 produces
>      strings as either ascii or iso-8859-1, and downcase can
>      work with that.


As a workaround (writing Unicode downcasing by hand in the absence of ICU 
is... tricky), can you convert the strings from Unicode to ISO-8859-1 with 
the trans_charset op?

    $I0 = find_charset 'iso-8859-1'
    $S0 = unicode:"Hello world"
    $S0 = trans_charset $S0, $I0

-- c

Re: [perl #48108] [BUG] downcase opcode fails on unicode strings w/o icu

Reply via email to