On Apr 27, 2004, at 10:25 AM, Dan Sugalski wrote:

At 9:40 AM -0700 4/27/04, Jeff Clites wrote:
On Apr 23, 2004, at 2:43 PM, Dan Sugalski wrote:

CHARACTER SET - Contains meta-information about code points. This
                includes both the meaning of individual code points
                (65 is capital A, 776 is a combining diaresis) as
                well as a set of categorizations of code
                points (alpha, numeric, whitespace, punctuation, and
                so on), and a sorting order.

I'm assuming here that you are referring to things like Shift-JIS and ISO-8859-1 as character sets, right?

Sort of. Shift-JIS is actually both a character set and an encoding, which makes life a bit confusing if not downright annoying.

I think you're basically forcing this concept onto national standards which lack it. I don't think that most of the national standards actually define the semantics of the characters they encode (categorizations, case mapping, sort order), and although they assign byte sequences to represent their characters, I'm not sure they actually present this in terms of assigning integers to them, in the sense of code points v. byte sequences.


So it sounds like we are going to make up a set of semantics, individually, for each character set which doesn't explicitly define their own (which, I think, is most of them). So, we have two choices: (1) do this arbitrarily and in ways which make different character sets/encoding actively conflict, or (2) come up with an assignment of semantics which makes them all fit together nicely, so that (for instance) the letter "A" comes out as having lowercase version "a", isAlpha, isNonNumeric, isHexDigit, isNotWhitespace, isNotPuncation, etc., for all of the character sets. Well, option (2) is what the Unicode Consortium spent years doing--coming up with a comprehensive list of the semantics and categorizations of every character represented in every major character set/standard.

The bottom line is that I don't think that anyone ever intended that the letter "A" have different semantics in each-and-every character set/encoding. In fact, they're all trying to provide potentially different ways to represent _the_same_ character. (I'm using "A" here because it's easy to represent in an email--other character choices might be more illustrative.) And the main task the Unicode Consortium carried out is to reconcile all of these. Forget about the encodings and things like that they've defined (UTF-8/16/32, etc.)--they're incidental. The important thing they did is not to define yet-another character set--they created the logical union of all of the others. They figured out where there was overlap, where they agreed, where there were inconsistencies, and dealt with them.

You've got to tear yourself away from this byte-centric view. It's the wrong mindset. Strings represent text. Text is made of characters. Characters are abstract things--the Platonic forms of letters, numbers, punctuation, etc. All of these different character sets/encodings are trying to digitally represent the same things--not to pick out whole separate notions. The letter A is the letter A is the letter A. When I type some text into my text editor and save it, I get a popup menu of choices for what encoding to use. My choice of UTF-16 v. Shift-JIS v. Latin-1 is inconsequential--those are just different way to represent _the_same_ text. All that matters is that when something, later, reads that file, it has a way to know which choice I made, so that it can decode those bytes and get to the text I saved.

So no matter how we choose to represent things in memory, the semantics can't depend on what on-disk representation I chose--it's supposed to the _the_same_ text.

And frankly it wouldn't take long to write a text editor which lets you sort and do case mapping, but doesn't let you save to a file. In cases like that (no IO), the notion of a character set or encoding need never come into play. But, I've still got text that I'm programmatically manipulating. Encoding only comes into play during IO (or the preparation for IO).

2) In light of the above, how do you sort an array of strings, assuming they're not all in the same "character set"?

You don't. Cross-set comparisons aren't valid--either the strings get promoted to a common set or an exception is thrown. Throwing an exception will be the default.


3) If the answer to (2) is "you must upgrade them all to UTF-8", then that means that the sort order for an array might totally change when you add one new member, right? If the answer is, "for a given pair, when you compare them during sorting, only upgrade if their character sets don't match", then you open the door to non-convergent sorting (ie, the sort might never finish).

Yep, that is a potential problem. The likely case, though, is that adding a string of a different type (character set or language) makes sorting impossible and pitches an exception instead.

Just throwing exceptions all of the time doesn't seem to be the most useful thing to do. We can do semantically better.


My worry here is that if the semantics of the Latin Capital Letter A ("A"), for example (or pick any other character), are allowed to differ between different "character sets", then we'll have problems for any binary string operation.

I've not really gotten into binary string operations. In general, cross-type operations will either throw exceptions or force an upgrade to a compatible character set. Upgrades will (or at least should) be sticky, so if you throw, say, a unicode string into an array full of Latin-1 characters, by the time you're done sorting everything'll be promoted to Unicode and worst case you'll have some ringing as the conversion propagates through.


I may, though, be completely deluded about that one.

Well, if you upgrade everything as you go, it will probably converge, but your sort order will likely depend on your initial order and your sort algorithm (that is, quicksort v. bubble sort v. heap sort), which is another way of saying it's indeterminate. This is because not every array element will end up being matched against all others, so only some of them will end up getting "upgraded". Certainly, the algorithmic efficiency will be decreased.


If upgrades are sticky (which makes sense, in order to minimize duplicated computation), then (due to the "character set" discussion above), the semantics of my strings will change upon sorting them (since their character sets will change).

See how that all doesn't make much sense?

JEff



Reply via email to