On Sun, Apr 16, 2006 at 11:36:10AM -0700, Nicholas Clark via RT wrote: > IIRC having ASCII as the default was a deliberate design choice to avoid > the confusion of "is it iso-8859-1 or is it utf-8" when encountering a > string literal with bytes outside the range 0-127.
Reasonable. Essentially I'm thinking the rule could be that any double-quoted literal with a \u sequence in it is utf-8, anything with \x (and no \u) is iso-8859-1, and all else is ASCII. I also proposed on IRC/#parrot that instead of automatically selecting the encoding, we have an "auto:" prefix for literals (or pick a better name) that would do the selection for us: $S1 = auto:"hello world" # ASCII $S2 = auto:"hello\nworld" # ASCII $S3 = auto:"hello \xabworld\xbb" # iso-8859-1 (or unicode) $S4 = auto:"3 \u2212 4 = \u207b 1" # Unicode Leo suggested doing it without "auto:" for the RFE. > If so, then I assume that the behaviour of your second example is wrong - it > should also be a malformed string. On this I'm only reporting what my parrot is telling me. :-) If it should be a malformed string, we should have a test for that (and I'll be glad to write it if this is the case). > If PGE is always outputting UTF-8 literals, what stops it from always > prefixing every literal "unicode:", even if it only uses Unicode characters > 0 to 127? The short answer is that some string operations on unicode strings currently require ICU in order to work properly, even if the string values don't contain any codepoints above 255. One such operation is "downcase", but there are others. So, if PGE prefixes every literal as "unicode:" (and this is how I originally had PGE) then systems w/o ICU inevitably fail with "no ICU library present" when certain string operations are attempted. Also, once introduced unicode strings tend can easily spread throughout the system, since an operation (e.g., concat) involving a UTF-8 string and an ASCII string produces a UTF-8 result even if all of the codepoints of the string value are in the ASCII range. Thus far PGE has handled this by looking at the resulting escaped literal and prefixing it with "unicode:" only if it will be needed -- i.e., if the escaped string has a '\x' or a '\u' somewhere in it. But now I'm starting to have to do this "check for unicode codepoint" in every PIR-emitting system I'm working with (PGE, APL, Perl 6, etc.), which is why I'm thinking that having PIR handle it directly would be a better choice. (And if we decide not to let PIR handle it, I'll create a library function for it.) I also realized this past week that using 'unicode:' on strings with \x (codepoints 128-255) may *still* be a bit too liberal -- the « french angles » will still cause "no ICU library present" errors, but would seemingly work just fine if iso-8859-1 is attempted. I'm not wanting to block systems w/o ICU from working on Perl 6, so falling back to iso-8859-1 in this case seems like the best of a bad situation. (OTOH, there are some potential problems with it on output.) Lastly, I suspect (and it's just a suspicion) that string operations on ASCII and iso-8859-1 strings are likely to be faster than their utf-8/unicode counterparts. If this is true, then the more strings that we can keep in ASCII, the better off we are. (And the vast majority of strings literals I seem to be generating in PIR contain only ASCII characters.) One other option is to make string operations such as downcase a bit smarter and figure out that it's okay to use the iso-8859-1 or ASCII algorithms/tables when the strings involved don't have any codepoints above 255. More comments and discussion welcome, Pm