On Sun, Apr 16, 2006 at 11:22:40AM -0700, Patrick R. Michaud wrote: > This is a suggestion regarding double-quoted string literals > in Parrot. Currently double-quoted strings are always assumed > to be ASCII unless prefixed by a different charset identifier > such as 'unicode:' or 'iso-8859-1:'. Unfortunately, this means > that string literals like: > > $S1 = "He said, \xabHello\xbb" > $S2 = "3 \u2212 4 = \u207b 1" > > are treated as ASCII strings even though they obviously contain > codepoints outside of the ASCII range. (The first results in a > 'malformed string' error when compiled, the second chops off the > high-order bits of the \u sequence.)
IIRC having ASCII as the default was a deliberate design choice to avoid the confusion of "is it iso-8859-1 or is it utf-8" when encountering a string literal with bytes outside the range 0-127. If so, then I assume that the behaviour of your second example is wrong - it should also be a malformed string. If PGE is always outputting UTF-8 literals, what stops it from always prefixing every literal "unicode:", even if it only uses Unicode characters 0 to 127? Nicholas Clark