On Sun, Apr 16, 2006 at 11:36:10AM -0700, Nicholas Clark via RT wrote:
> IIRC having ASCII as the default was a deliberate design choice to avoid 
> the confusion of "is it iso-8859-1 or is it utf-8" when encountering a
> string literal with bytes outside the range 0-127.

Reasonable.  Essentially I'm thinking the rule could be that any
double-quoted literal with a \u sequence in it is utf-8,
anything with \x (and no \u) is iso-8859-1, and all else is
ASCII.

I also proposed on IRC/#parrot that instead of automatically 
selecting the encoding, we have an "auto:" prefix for literals
(or pick a better name) that would do the selection for us:

     $S1 = auto:"hello world"             # ASCII
     $S2 = auto:"hello\nworld"            # ASCII
     $S3 = auto:"hello \xabworld\xbb"     # iso-8859-1 (or unicode)
     $S4 = auto:"3 \u2212 4 = \u207b 1"   # Unicode

Leo suggested doing it without "auto:" for the RFE.

> If so, then I assume that the behaviour of your second example is wrong - it
> should also be a malformed string.

On this I'm only reporting what my parrot is telling me.  :-)
If it should be a malformed string, we should have a test for
that (and I'll be glad to write it if this is the case).

> If PGE is always outputting UTF-8 literals, what stops it from always
> prefixing every literal "unicode:", even if it only uses Unicode characters
> 0 to 127?

The short answer is that some string operations on unicode strings
currently require ICU in order to work properly, even if the string 
values don't contain any codepoints above 255.  One such operation 
is "downcase", but there are others.

So, if PGE prefixes every literal as "unicode:" (and this is how
I originally had PGE) then systems w/o ICU inevitably fail with 
"no ICU library present" when certain string operations are attempted.  
Also, once introduced unicode strings tend can easily spread 
throughout the system, since an operation (e.g., concat) involving 
a UTF-8 string and an ASCII string produces a UTF-8 result 
even if all of the codepoints of the string value are in the 
ASCII range.

Thus far PGE has handled this by looking at the resulting
escaped literal and prefixing it with "unicode:" only if it 
will be needed -- i.e., if the escaped string has a '\x'
or a '\u' somewhere in it.  But now I'm starting to have to 
do this "check for unicode codepoint" in every PIR-emitting 
system I'm working with (PGE, APL, Perl 6, etc.), which is 
why I'm thinking that having PIR handle it directly would 
be a better choice.  (And if we decide not to let PIR handle
it, I'll create a library function for it.)

I also realized this past week that using 'unicode:' on
strings with \x (codepoints 128-255) may *still* be a bit 
too liberal -- the « french angles » will still cause 
"no ICU library present" errors, but would seemingly work
just fine if iso-8859-1 is attempted.  I'm not wanting
to block systems w/o ICU from working on Perl 6,
so falling back to iso-8859-1 in this case seems like the 
best of a bad situation.  (OTOH, there are some potential 
problems with it on output.)

Lastly, I suspect (and it's just a suspicion) that string
operations on ASCII and iso-8859-1 strings are likely to be
faster than their utf-8/unicode counterparts.  If this is
true, then the more strings that we can keep in ASCII,
the better off we are.  (And the vast majority of strings
literals I seem to be generating in PIR contain only ASCII
characters.)

One other option is to make string operations such as
downcase a bit smarter and figure out that it's okay
to use the iso-8859-1 or ASCII algorithms/tables when
the strings involved don't have any codepoints above 255.

More comments and discussion welcome,

Pm

Reply via email to