I don't want to argue per-se (that doesn't do anyone any good), so if your mind is made up, that's cool... still, I think there's some value in exploring the options, so read on if you're so inclined.
On Wed, 2004-08-11 at 04:40, Dan Sugalski wrote: > > Converting Unicode to non-Unicode character sets will be > > lossless where possible, and will attempt to encode the name of > > the character in ASCII characters into the target character set. > > Gack. No, I think this'd be a bad idea as the default behavior. Well ok, why not make an exception the default behavior then? Just reverse what I suggested from the default to the option. It's still mighty handy for a language (any Parrot-based language) to be able to render a meaningful string in any ASCII-capable encoding from any Unicode subset. I think the only problem would be in the realm of directionality of script, but I assume that all non L-R scripts have some convention for injecting snippits of L-R, just as en-US injects R-L, easy as "Ù Ù Ù". > What's right is up in the air -- I'm figuring we'll either throw an > exception or substitute in a default character, but the full > expansion's definitely way too much. That's too bad, as: "This was converted from ïnicode" becoming "This was converted from {FULLWIDTH LATIN CAPITAL LETTER U}nicode" seems much more reasonable than choosing some poor ASCII character to act as the fallback. If someone does something stupid like converting a 5MB document in UTF-8 encoded Cyrillic into ASCII, then they're going to get a huge result, but that's no less useful than 3MB of text that looks like "**** ** **** ***-**. ***'* *****", I would think, and perhaps more useful for certain purposes (e.g. it could still be deciphered and/or re-assembled). The other way to go would be some sort of standardized low-level notation to represent encoding and codepoint such as: "This was converted from {U+FF35}nicode" That's less readable, but arguably more reversible and/or precise. Certainly more easily automatically detected. For example, the following Perl 5 code could reverse such transformation: s{\{(.)\+([a-f\d]+)\}}{ character(target_encoding => $target_encoding, source_encoding => abbrv_to_encoding($1), source_codepoint => hex("0x".$2)) }eg; assuming, of course, a function "character" and a function "abbrv_to_encoding" which attempt generate a character in a target encoding based on a character in a source encoding and return an encoding ID/name/object/whatever based on a one-character abbreviation respectively. It would be ideal if other tactics could be used like the GB 2312 encoding in ASCII described in RFC 1842. Of course, the above could be permuted that way: "This was converted from {G+~{<:Ky2;S{#,NpJ)l6HK!#~}}" But that starts to get deeper into character set and encoding transformation than my head is capable of coping with at this stage (I'm really just learning about these topics). I fear I'm walking down a road that ends in my suggesting that every non-Unicode string has a MIME header, but rest assured that that's not my goal. I just wanted to suggest a useful alternative to throwing an exception on incompatible type conversion, especially for those client languages (e.g. m4) in which an exception will either have to be ignored or treated as fatal. -- â 781-324-3772 â [EMAIL PROTECTED] â http://www.ajs.com/~ajs