Hi Eric,
>> would anyone out there happen to know how did arabic DOS, on the old
>> days, deal with:
>>
>> 1) The control characters needed to handle the script - ZWJ (Zero-width
>> joiner), ZWNJ (zero-width non-joiner), RLM (right-to-left mark), LRM
>> (left-to-right mark) and control characters needed to handle bilingual
>> text (LTR and RTL) in a same sentence: RLE/LRE (right-to-left and
>> left-to-right embedding), RLO/LRO (right-to-left and left-to-right
>> override) and PDF (POP directional Formatting).
> All of those sound like control characters which would have
> to be understood by DISPLAY or similar and which will need
> space in the codepage, possibly in lesser used control char
> areas (ASCII 0 to 31 somewhere). (...)
They /are/ part of codepages, as a matter of fact. I've found ZWJ and 
ZWNJ on ISO-8859-6 and all the other control characters mentioned on (1) 
at range A0h-A6h of both arabic codepage 862 and hebrew codepage 856. 
There is no visual representation of them, unlike what happens to the 
control characters found at range 00h-1Fh and 7Fh. Therefore, there's 
nothing to be done by DISPLAY or MODE. There must have had proper 
arabic/hebrew text editors (and other applications) out there which knew 
how to take advantage of those control characters.
>> 2) Codepage 720 and many others which only present the isolated shapes
>> of the characters. DOS, seemingly, had somehow to rely on subfonts or
>> any feature which would cause DOS to trade the characters' isolated
>> shapes for their initial, medial or final shapes on-the-fly as the text
>> was typed.
> Maybe it just looked ugly and used non-contextual shapes? ;-)
Hmmmm... Very unlikely to have happened this way. If you ever saw a text 
written with the arabic script, even it being in the correct direction 
(right-to-left) though with letters only in their isolated shapes, you 
would agree that it was chaotic to the point of not being used that way. 
There must have had some trick somewhere.
>> 3) Combining chars. All arabic codepages, including cp864, include at
>> least two codepoints which present them.
> You mean Unicode would represent them either as pre-combined
> or as some character plus a separate accent "character"? Not
> something that DOS is likely to have cared about, probably
> it only used pre-composed characters and had the characters
> without accent as separate entities, just like Latin vowels
> and Latin accented vowels (umlauts etc) having separate full
> shape font items in CP850 and similar. Note that CP850 does
> not even have "double dot above" for composition, it only
> has that as part of pre-composed umlaut character stapes...
I think that for the Unicode consortium to ever provide precomposed 
"accented" arabic (or hebrew, or syriac, or divehi) letters is a very 
unlikely thing to happen... Arabic (and hebrew, and syriac) letters are 
not "accented". The combining chars used on these scripts perform a 
whole different role and they're even dismissed on most scenarios (but 
mandatory on others). There is also the case of the divehi script, which 
is also written right-to-left, even looks like the arabic script for the 
non-trained eye and makes a much heavier use of combining chars because 
they're always mandatory for every single letter in every word.

The vietnamese case is an interesting parallel. Before the availability 
of Unicode, vietnamese computers dealt with codepages which provided all 
their accented letters in a precomposed fashion, since they also 
seemingly didn't handle combining chars on DOS. Now we find all those 
precomposed accented latin vietnamese letters on Unicode - though for 
compatibility with legacy applications only, because nowadays the 
vietnamese only type their text by making (heavy) use of the 5 combining 
chars that they need: acute, grave, tilde, dot below and horn. Perhaps 
if it was ever possible to encode all precomposed arabic "accented" 
letters in 8-bit codepages we would have them in Unicode today but for 
the same single reason - backward compatibility.

By the way, in what comes to cp850, there are stand-alone cedilla, acute 
accent, diaeresis and macron, probably to be used only as "combinining 
printing chars" since this is how we used them on the old days when we 
wanted to print portuguese text on printers which did not provide 
hardcoded codepages.
>> Hebrew DOS is a simpler case yet topic #3 also applies to the script
>> and, with the exception of control characters ZWJ and ZWNJ, topic #1
>> also does.
> So it is interesting to hear how Hebrew codepages "tick" :-)
Well... Almost. It "ticks" as much as arabic codepages do, provided that 
users don't need combining chars. :-)

Henrique

------------------------------------------------------------------------------
AppSumo Presents a FREE Video for the SourceForge Community by Eric 
Ries, the creator of the Lean Startup Methodology on "Lean Startup 
Secrets Revealed." This video shows you how to validate your ideas, 
optimize your ideas and identify your business strategy.
http://p.sf.net/sfu/appsumosfdev2dev
_______________________________________________
Freedos-user mailing list
Freedos-user@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/freedos-user

Reply via email to