Re: String rationale

James Mastros Mon, 29 Oct 2001 15:48:48 -0800

On Mon, Oct 29, 2001 at 11:20:47PM +0000, Tom Hughes wrote:
> > 2) But either can support converting directly if it wants.
> The danger is that everybody tries to be clever and support direct
> conversion to and from as many other character sets as possible, which
> leads to lots of duplication.
Yeah.  But that's a convention thing, I think.  I also think that most
people won't go to the bother of writing conversion functions that they
don't have to.  What we need to worry about is both, say, big5 and shiftjis
writing both of the conversions.  And it shouldn't come up all that much,
because Unicode is /supposted to be/ lossless for most things.


> I have already been thinking about this although it does get more
> complicated as you have to consider the encoding as well - if you
> have a single byte encoded ASCII string then transcoding to a single
> byte encoded Latin-1 string is a no-op, but that may not be true for
> other encodings if such a thing makes sense for those character types.
Hm.  All the encodings I can think of (which is rather limited -- the UTFs),
you can scan for units (IE ints of the proper size) > 0x7f, and if you don't
find any, it's 7bit, and you can just change the charset marker without
doing any work.

In any case, it's up to the encoding to tell if we've got a pure 7bit
string.  If that's complicated for it, it can just always return FALSE.

> I suspect that the encode and decode methods in the encoding vtable
> are enough for doing chr/ord aren't they?
Hmm... come to think of it, yes.  chr will always create a utf32-encoded
string with the given charset number (or unicode for the two-arg version),
ord will return the codepoint within the current charset.

(This, BTW, means that only encodings that feel like it have to provide
either, but all encodings must be able to convert to utf32.)

Powers-that-be (I'm looking at you, Dan), is that good?

               -=- James Mastros

Re: String rationale

Reply via email to