Re: What Unicode means to us

Aaron Sherman Fri, 13 Aug 2004 01:12:22 -0700

I don't want to argue per-se (that doesn't do anyone any good), so if
your mind is made up, that's cool... still, I think there's some value
in exploring the options, so read on if you're so inclined.

On Wed, 2004-08-11 at 04:40, Dan Sugalski wrote:

> >         Converting Unicode to non-Unicode character sets will be
> >         lossless where possible, and will attempt to encode the name of
> >         the character in ASCII characters into the target character set.
> 
> Gack. No, I think this'd be a bad idea as the default behavior. 

Well ok, why not make an exception the default behavior then? Just
reverse what I suggested from the default to the option. It's still
mighty handy for a language (any Parrot-based language) to be able to
render a meaningful string in any ASCII-capable encoding from any
Unicode subset.

I think the only problem would be in the realm of directionality of
script, but I assume that all non L-R scripts have some convention for
injecting snippits of L-R, just as en-US injects R-L, easy as "Ù Ù Ù".

> What's right is up in the air -- I'm figuring we'll either throw an 
> exception or substitute in a default character, but the full 
> expansion's definitely way too much.

That's too bad, as:

        "This was converted from ïnicode"

becoming

        "This was converted from {FULLWIDTH LATIN CAPITAL LETTER U}nicode"

seems much more reasonable than choosing some poor ASCII character to
act as the fallback.

If someone does something stupid like converting a 5MB document in UTF-8
encoded Cyrillic into ASCII, then they're going to get a huge result,
but that's no less useful than 3MB of text that looks like "**** ** ****
***-**. ***'* *****", I would think, and perhaps more useful for certain
purposes (e.g. it could still be deciphered and/or re-assembled).

The other way to go would be some sort of standardized low-level
notation to represent encoding and codepoint such as:

        "This was converted from {U+FF35}nicode"

That's less readable, but arguably more reversible and/or precise.
Certainly more easily automatically detected. For example, the following
Perl 5 code could reverse such transformation:

        s{\{(.)\+([a-f\d]+)\}}{
                character(target_encoding  => $target_encoding,
                          source_encoding  => abbrv_to_encoding($1),
                          source_codepoint => hex("0x".$2))
        }eg;

assuming, of course, a function "character" and a function
"abbrv_to_encoding" which attempt generate a character in a target
encoding based on a character in a source encoding and return an
encoding ID/name/object/whatever based on a one-character abbreviation
respectively.

It would be ideal if other tactics could be used like the GB 2312
encoding in ASCII described in RFC 1842. Of course, the above could be
permuted that way:

        "This was converted from {G+~{<:Ky2;S{#,NpJ)l6HK!#~}}"

But that starts to get deeper into character set and encoding
transformation than my head is capable of coping with at this stage (I'm
really just learning about these topics). I fear I'm walking down a road
that ends in my suggesting that every non-Unicode string has a MIME
header, but rest assured that that's not my goal. I just wanted to
suggest a useful alternative to throwing an exception on incompatible
type conversion, especially for those client languages (e.g. m4) in
which an exception will either have to be ignored or treated as fatal.

-- 
â 781-324-3772
â [EMAIL PROTECTED]
â http://www.ajs.com/~ajs

Re: What Unicode means to us

Reply via email to