On 4/30/2010 1:56 AM, José Carlos Santos wrote:
Since no sophisticated solution appeared (or occurred to me), I shall do that. But I think it is a flaw.

You don't need a sophisticated solution when the simple one is the only correct one. In the old days we had to worry about which character set we were using because different sets gave you different characters at identical codepoints. In the old, dark days of not-a-lot-of-memory and "there are other languages?" land, resolving chr(128) could give you the euro symbol, or it might be the first byte in a multi byte japanese character, and - and this is the important part - the file you were reading wouldn't in any way indicate wtf you were supposed to do.

With more modern computing technology available, and a realisation that there are such things as other countries and writing systems, this became ridiculous, and we came up with a new way of doing things: unicode, mirrored by the ISO/IEC 10646 standard. This convention specifies a huge character map with every distinct character-entity in its own spot, and specifies various ways in which you can encode the parts of that map that you actually use efficiently in a file, so that on average storiung unicode data takes up an irrelevantly small amount more diskspace than storing it using the crooked concept of codepages. More importantly, data stored in unicode can TELL you that it's stored in unicode. With that, the world finally realised how much better things were if the data actually indicated how to decode it, instead of having people go "okay but... what the heck codepage is this text file actually in?".

The whole reason XeTeX exists is because there was no real unicode awareness in the various flavours of TeX until Jonathan Kew started making one: the lack of indicating an encoding in XeTeX is not a flaw, but represents a victory for people who actually want to write things properly: this is what we should have had in the first place, if there hadn't been all those early computing technological restrictions. It finally let us all write in every language imaginable without having to worry about whether or not that letter we wanted was in the code page we were using, and if it wasn't, how to construct what should be an arbitrarily simple character using lots of TeX code to combine letters and symbols in ways that only worked for that one font we were actually using. By going with unicode, XeTeX made, and still makes, things intuitively easy. You write your text, XeTeX compiles what you wrote, and you are not bothered by trying to figure out whether or not the character you want is in the codepage you're using.

Of course, note that this is is very different from needing to verify that the character you want is in the font you are using. codepage tell you which characters even exist as far as the computer is concerned. Need a lambda symbol when you're writing something n cp1252? Tough, it doesn't exist. Not just "in the font you are using", it simply doesn't exist until you change the codepage for your entire data context to something else.

Codepages are a thing from a dark past, when typesetting was severely impaired by fonts simply not being big enough to actually contain all the letters people might need, and there not being a well defined codepoint mapping for glyphs (what you see with your eyes) and characters (what the thing you're seeing actually represents).

At this point in time (finally, one might add) only old operating systems still really care about codepages - the rest have moved on to embrace a world where it doesn't matter what language you write in, because letters from one language are no longer mutually exclusive with another. In the TeX world, too, there's great efforts being made to ditch the antique concept of codepages, with XeTeX and LuaTex constantly improving.

If you want to typeset things nicely, and you actually care about the language you're using - you're using French, so you really should care- don't use cp1252; ANSI is the *AMERICAN* standard for an 8-bit character set. Codepages were invented to overcome the problem of only having 256 spots for letters. Unicode solved that problem. Why make XeTeX use a solution for a problem that doesn't exist anymore?

- Mike


--------------------------------------------------
Subscriptions, Archive, and List information, etc.:
 http://tug.org/mailman/listinfo/xetex

Reply via email to