Re: TODOs for STRINGs

Dan Sugalski Fri, 04 Jan 2002 08:26:06 -0800

At 07:42 PM 1/2/2002 +0000, Tom Hughes wrote:
>In message <20020102054642$[EMAIL PROTECTED]>
>           "David & Lisa Jacobs" <[EMAIL PROTECTED]> wrote:
>
> > Here is a short list of TODOs that I came up with for STRINGs.  First, do
> > these look good to people?  And second, what is the preferred method for
> > keeping track of these (patch to the TODO file, entries in bug tracking
> > system, mailing list, etc.
> >
> > * Add set ops that are encoding aware (e.g., set S0, "something", 
> "unicode",
> > "utf-8")?
>
>You can already have Unicode constants by prefixing the string
>with a U character. I seem to recall Dan saying that he didn't want
>to allow constants in arbitrary encodings but instead would prefer
>just to have native and unicode.


I'm of two minds on arbitrary encoding constants. Practically speaking, 
though, we ought to have some way of doing them, though that'll make the 
bytecode segments a bit more complex.

> > * Add transcoding ops (this might be a specific case of the previous e.g.,
> > set S0, S1, "unicode", "utf-16")
>
>I'm not sure whether this is needed. I think the idea is that in
>general transcoding will happen at I/O time, presumably by pushing
>a transcoding module on the I/O stack.

I can see the need to explicitly transcode strings. We should have a 
transcode op for this, I think.

> > * Move like encoded string comparison into encodings (i.e., the STRING
> > comparison function gets the strings into the same encoding and then calls
> > out to the encodings comparison function - This will allow each encoding to
> > optimize its comparison.
>
>The problem with this is that string comparison depends on both the
>encoding and the character set so in general you can't do this. If
>the character set was the same for both strings then you could do so
>though.
>
>What I did think about was having a flag on each encoding that
>specified whether or not comparisons for that encoding could be
>done using memcmp() when the character sets were the same. That
>is true for things like the single byte encoding, but probably
>not for the unicode encodings due to canonicalisation issues.

It's also not true for Unicode because there are several different ways to 
sort Unicode strings, and most of them don't have anything to do with the 
order of the characters in the character set. (Granted you don't need this 
for eq/ne, though there are normalization issues there, but you do for cmp 
style comparisons)

                                        Dan

--------------------------------------"it's like this"-------------------
Dan Sugalski                          even samurai
[EMAIL PROTECTED]                         have teddy bears and even
                                      teddy bears get drunk

Re: TODOs for STRINGs

Reply via email to