Re: [Q1] (Re: The strings design document)

Dan Sugalski Tue, 27 Apr 2004 10:29:32 -0700

At 9:40 AM -0700 4/27/04, Jeff Clites wrote:

On Apr 23, 2004, at 2:43 PM, Dan Sugalski wrote:

CHARACTER SET - Contains meta-information about code points. This
                includes both the meaning of individual code points
                (65 is capital A, 776 is a combining diaresis) as
                well as a set of categorizations of code
                points (alpha, numeric, whitespace, punctuation, and
                so on), and a sorting order.

I'm assuming here that you are referring to things like Shift-JIS and ISO-8859-1 as character sets, right?

Sort of. Shift-JIS is actually both a character set and an encoding, which makes life a bit confusing if not downright annoying.

Questions (based on that assumption):
[*Note: assume everywhere below that the strings in question are not explicitly language-tagged (or, are tagged with "Dunno"--however it's supposed to work).]

1) ISO-8859-1 is used to represent text in several different languages, including German and Swedish. German and Swedish differ in their sort order, even for things they have in common. (For example, ö (o-with-diaeresis) is considered a separate letter in Swedish, but is just a accented "o" in German.) So (assuming my strings aren't explicitly langauge-tagged, or are tagged with "Dunno"), what sort order does ISO-8859-1 define? I'm not sure whether the national standards themselves actually define a sort order, so are we going to define one for every "character set"? In addition, many languages can be represented in several different "character set", so that seems to mean that the sort order for "öut" v. "out" will vary, depending on the "character set" used for those strings?


That's possible, yes.

Each character set has a default sort ordering (amongst other things), which will be used in the absence of overriding data.

2) In light of the above, how do you sort an array of strings, assuming they're not all in the same "character set"?

You don't. Cross-set comparisons aren't valid--either the strings get promoted to a common set or an exception is thrown. Throwing an exception will be the default.

3) If the answer to (2) is "you must upgrade them all to UTF-8", then that means that the sort order for an array might totally change when you add one new member, right? If the answer is, "for a given pair, when you compare them during sorting, only upgrade if their character sets don't match", then you open the door to non-convergent sorting (ie, the sort might never finish).

Yep, that is a potential problem. The likely case, though, is that adding a string of a different type (character set or language) makes sorting impossible and pitches an exception instead.

My worry here is that if the semantics of the Latin Capital Letter A ("A"), for example (or pick any other character), are allowed to differ between different "character sets", then we'll have problems for any binary string operation.

I've not really gotten into binary string operations. In general, cross-type operations will either throw exceptions or force an upgrade to a compatible character set. Upgrades will (or at least should) be sticky, so if you throw, say, a unicode string into an array full of Latin-1 characters, by the time you're done sorting everything'll be promoted to Unicode and worst case you'll have some ringing as the conversion propagates through.

I may, though, be completely deluded about that one.
--
                                        Dan

--------------------------------------"it's like this"-------------------
Dan Sugalski                          even samurai
[EMAIL PROTECTED]                         have teddy bears and even
                                      teddy bears get drunk

Re: [Q1] (Re: The strings design document)

Reply via email to