Re: String representation

Jarkko Hietaniemi Mon, 18 Dec 2000 10:12:42 -0800
On Mon, Dec 18, 2000 at 03:21:05PM +0000, Nick Ing-Simmons wrote:
> Simon Cozens <[EMAIL PROTECTED]> writes:
> >
> >So, before we start even thinking about what we need, it's time to look at the
> >vexed question of string representation. How do we do Unicode without getting
> >into the horrendous non-Latin1 cockups we're seeing on p5p right now? 
> 
> Well - my theorist's answer is that everything is Unicode - like Java.

That would be nice, yes.

> As I pointed out on p5p even EBCDIC machines can use that model - but 
> the downside is that ord('A') == 65 which will breaks backward compatibility 
> with EBCDIC scripts. 

Maybe we need $ENV{PERL_ENCODING} to control ord() and chr(), too?

> Tagging a string with a repertoire and encoding is horrible - you are aware 

Indeed.  We have had a very rough ride trying to get just two
encodings to play well together, trying to support more simultaneously
would be pure combinatorial masochism.  I say we should strive for
converting everything to/from one agreed-upon internal encoding.  Yes,
this is somewhat counter to the idea 'no preferred internal encoding'.
After pondering about the issue I have come around to "Oh, yes, there
should be one preferred internal encoding.", otherwise we banish
ourselves to much nashing of the teeth.  Off-hand, I think it's only
when there would be information loss when the One True Encoding
conversion shouldn't be done.  What's the OTE, then?  Well, UTF-16 or
UTF-32, I guess.  The redeeming features of UTF-8, that it is 1:1 for
ASCII, and also compact for ASCII, frankly are getting rather thing in
my eyes.

> The only sane compromise I can imagine is close to what we have at the 
> moment with maybe a few extra special cases in the "flags" bits:
>    ASCII only           (0..7f)
>    Native-single-byte   (iso8859-x, IBM1047)
>    wchar_t 
>    UTF-8
>    UNICODE

Maybe.

-- 
$jhi++; # http://www.iki.fi/jhi/
        # There is this special biologist word we use for 'stable'.
        # It is 'dead'. -- Jack Cohen
Re: String representation

Reply via email to