On Sun, Apr 16, 2006 at 11:22:40AM -0700, Patrick R. Michaud wrote:

> This is a suggestion regarding double-quoted string literals
> in Parrot.  Currently double-quoted strings are always assumed
> to be ASCII unless prefixed by a different charset identifier
> such as 'unicode:' or 'iso-8859-1:'.  Unfortunately, this means
> that string literals like:
> 
>     $S1 = "He said, \xabHello\xbb"
>     $S2 = "3 \u2212 4 = \u207b 1"
> 
> are treated as ASCII strings even though they obviously contain
> codepoints outside of the ASCII range.  (The first results in a 
> 'malformed string' error when compiled, the second chops off the
> high-order bits of the \u sequence.)

IIRC having ASCII as the default was a deliberate design choice to avoid 
the confusion of "is it iso-8859-1 or is it utf-8" when encountering a
string literal with bytes outside the range 0-127.

If so, then I assume that the behaviour of your second example is wrong - it
should also be a malformed string.

If PGE is always outputting UTF-8 literals, what stops it from always
prefixing every literal "unicode:", even if it only uses Unicode characters
0 to 127?

Nicholas Clark

Reply via email to