At 07:42 PM 1/2/2002 +0000, Tom Hughes wrote: >In message <20020102054642$[EMAIL PROTECTED]> > "David & Lisa Jacobs" <[EMAIL PROTECTED]> wrote: > > > Here is a short list of TODOs that I came up with for STRINGs. First, do > > these look good to people? And second, what is the preferred method for > > keeping track of these (patch to the TODO file, entries in bug tracking > > system, mailing list, etc. > > > > * Add set ops that are encoding aware (e.g., set S0, "something", > "unicode", > > "utf-8")? > >You can already have Unicode constants by prefixing the string >with a U character. I seem to recall Dan saying that he didn't want >to allow constants in arbitrary encodings but instead would prefer >just to have native and unicode.
I'm of two minds on arbitrary encoding constants. Practically speaking, though, we ought to have some way of doing them, though that'll make the bytecode segments a bit more complex. > > * Add transcoding ops (this might be a specific case of the previous e.g., > > set S0, S1, "unicode", "utf-16") > >I'm not sure whether this is needed. I think the idea is that in >general transcoding will happen at I/O time, presumably by pushing >a transcoding module on the I/O stack. I can see the need to explicitly transcode strings. We should have a transcode op for this, I think. > > * Move like encoded string comparison into encodings (i.e., the STRING > > comparison function gets the strings into the same encoding and then calls > > out to the encodings comparison function - This will allow each encoding to > > optimize its comparison. > >The problem with this is that string comparison depends on both the >encoding and the character set so in general you can't do this. If >the character set was the same for both strings then you could do so >though. > >What I did think about was having a flag on each encoding that >specified whether or not comparisons for that encoding could be >done using memcmp() when the character sets were the same. That >is true for things like the single byte encoding, but probably >not for the unicode encodings due to canonicalisation issues. It's also not true for Unicode because there are several different ways to sort Unicode strings, and most of them don't have anything to do with the order of the characters in the character set. (Granted you don't need this for eq/ne, though there are normalization issues there, but you do for cmp style comparisons) Dan --------------------------------------"it's like this"------------------- Dan Sugalski even samurai [EMAIL PROTECTED] have teddy bears and even teddy bears get drunk