Hi Eric, >> would anyone out there happen to know how did arabic DOS, on the old >> days, deal with: >> >> 1) The control characters needed to handle the script - ZWJ (Zero-width >> joiner), ZWNJ (zero-width non-joiner), RLM (right-to-left mark), LRM >> (left-to-right mark) and control characters needed to handle bilingual >> text (LTR and RTL) in a same sentence: RLE/LRE (right-to-left and >> left-to-right embedding), RLO/LRO (right-to-left and left-to-right >> override) and PDF (POP directional Formatting). > All of those sound like control characters which would have > to be understood by DISPLAY or similar and which will need > space in the codepage, possibly in lesser used control char > areas (ASCII 0 to 31 somewhere). (...) They /are/ part of codepages, as a matter of fact. I've found ZWJ and ZWNJ on ISO-8859-6 and all the other control characters mentioned on (1) at range A0h-A6h of both arabic codepage 862 and hebrew codepage 856. There is no visual representation of them, unlike what happens to the control characters found at range 00h-1Fh and 7Fh. Therefore, there's nothing to be done by DISPLAY or MODE. There must have had proper arabic/hebrew text editors (and other applications) out there which knew how to take advantage of those control characters. >> 2) Codepage 720 and many others which only present the isolated shapes >> of the characters. DOS, seemingly, had somehow to rely on subfonts or >> any feature which would cause DOS to trade the characters' isolated >> shapes for their initial, medial or final shapes on-the-fly as the text >> was typed. > Maybe it just looked ugly and used non-contextual shapes? ;-) Hmmmm... Very unlikely to have happened this way. If you ever saw a text written with the arabic script, even it being in the correct direction (right-to-left) though with letters only in their isolated shapes, you would agree that it was chaotic to the point of not being used that way. There must have had some trick somewhere. >> 3) Combining chars. All arabic codepages, including cp864, include at >> least two codepoints which present them. > You mean Unicode would represent them either as pre-combined > or as some character plus a separate accent "character"? Not > something that DOS is likely to have cared about, probably > it only used pre-composed characters and had the characters > without accent as separate entities, just like Latin vowels > and Latin accented vowels (umlauts etc) having separate full > shape font items in CP850 and similar. Note that CP850 does > not even have "double dot above" for composition, it only > has that as part of pre-composed umlaut character stapes... I think that for the Unicode consortium to ever provide precomposed "accented" arabic (or hebrew, or syriac, or divehi) letters is a very unlikely thing to happen... Arabic (and hebrew, and syriac) letters are not "accented". The combining chars used on these scripts perform a whole different role and they're even dismissed on most scenarios (but mandatory on others). There is also the case of the divehi script, which is also written right-to-left, even looks like the arabic script for the non-trained eye and makes a much heavier use of combining chars because they're always mandatory for every single letter in every word.
The vietnamese case is an interesting parallel. Before the availability of Unicode, vietnamese computers dealt with codepages which provided all their accented letters in a precomposed fashion, since they also seemingly didn't handle combining chars on DOS. Now we find all those precomposed accented latin vietnamese letters on Unicode - though for compatibility with legacy applications only, because nowadays the vietnamese only type their text by making (heavy) use of the 5 combining chars that they need: acute, grave, tilde, dot below and horn. Perhaps if it was ever possible to encode all precomposed arabic "accented" letters in 8-bit codepages we would have them in Unicode today but for the same single reason - backward compatibility. By the way, in what comes to cp850, there are stand-alone cedilla, acute accent, diaeresis and macron, probably to be used only as "combinining printing chars" since this is how we used them on the old days when we wanted to print portuguese text on printers which did not provide hardcoded codepages. >> Hebrew DOS is a simpler case yet topic #3 also applies to the script >> and, with the exception of control characters ZWJ and ZWNJ, topic #1 >> also does. > So it is interesting to hear how Hebrew codepages "tick" :-) Well... Almost. It "ticks" as much as arabic codepages do, provided that users don't need combining chars. :-) Henrique ------------------------------------------------------------------------------ AppSumo Presents a FREE Video for the SourceForge Community by Eric Ries, the creator of the Lean Startup Methodology on "Lean Startup Secrets Revealed." This video shows you how to validate your ideas, optimize your ideas and identify your business strategy. http://p.sf.net/sfu/appsumosfdev2dev _______________________________________________ Freedos-user mailing list Freedos-user@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/freedos-user