Hi Rugxulo, >>> Unicode (now at 6.0) is pretty damn huge. I don't know if... >> >> While Unicode is huge, DOS keyboard layouts tend to be limited to >> Latin and Cyrillic and some other symboly which is a tiny subset. > > Well, determining which "subset" (for us) is the main problem.
You could start using RECODE (DJGPP port if you like) and convert all DOS keyboard layouts that you can find from the codepage for which they are made into Unicode, then make a list of all distinct characters that you found... I would be surprised if it were more than 1000 of them, so even the "12000 useful chars" is a very high estimate. >> Of course there are "input methods" where you can type multiple >> keys and/or press complex combinations of keys to enter e.g. CJK >> glyphs (Chinese Japanese Korean) but that is a quite different Therefore, as somebody else already said, topic for another thread. > Right-to-left might be hard to do (I guess?) Not really, but it often means MIXING directions when you want to mention ASCII words on a right to left system, I would guess. Yet again something for another thread or simply for Blocek ;-) > I think I read on Wikipedia the other day that Unicode was originally > only 16-bit, e.g. they thought it would cover "most popular languages > currently in use", but it was later expanded. Yes. And as said, while 20 bit can be encoded as 2 surrogates of 16 bit each, UTF-8 can be used to encode up to 31 bits and apart from UTF-16 there is of couse a possibility for UTF-32. >>> 1). Chinese (hard) Well just needs multiple infrastructure things such as special DISPLAY driver (graphics mode, e.g. 16x16 font or 2*8x16 text if you can manage to do everything with 512 half char shapes?) and special keyboard input method driver and kernel with DBCS plus of course DBCS awareness in apps which want to use CJK... >>> 4). Arabic (easy??) >> >> Unicode lists maybe 300 chars for that, at most. > > Really? Wikipedia lists 28 char alphabet (single case), IIRC. I was just checking the ranges of char numbers, not how well they are actually populated. Maybe accents added? [Hindi Devangari Bengali...] Well sounds like a case for an ISCII codepage font :-) >> The well-known cyrillic codepages squeeze ASCII and Cyrillic >> (probably not all theoretically possible accents) in 256 chars. > > Probably like others only includes the "important" stuff. Not necessarily. Like Latin, Cyrillic does not have THAT many accents in the language family. The problem is that DOS codepages often try to have too many symbols or even box drawing chars. Which in turn means that there is no DOS codepage with ALL Latin accented chars in it, even though I get the impression that 2 * 60 accented chars would cover the big majority of all Latin like writings. >>> 9). Japanese (hard) > > I didn't even look this one up, but I vaguely remember reading once > that they use two or three scripts (ugh): hiragana, kanji, katakana In general the CJK language family seems to have simplified and more ornamental / older versions for their big set of word / syllable like glyphs. Plus indeed one or two ways of writing more alphabetically for e.g. foreign words. And the latter is small - I remember that small text-only LCD matrix displays (e.g. with 5x7 font) only use 7-8 bits per char :-) > BTW, wasn't your major in something like computational linguistics? You remember my old email address at coli.uni-sb.de correctly ;-) >> charsets like ASCII or Latin need only 1-2 bytes while you can >> still encode up to 31 bits: U+07FF still fits 2 bytes and all >> 16 bit chars need only 3 bytes, the rest is very rare... > > I think the real (proposed) advantage is that it doesn't waste space > if your main language(s) are Western. Also the byte stream is > recoverable if interrupted... Yes, which means if you send UTF-8 to a display which expects 1 byte per char (e.g. Latin) or Latin to a display which wants UTF-8, the mess will be local to around the non-ASCII parts :-) Also, while 2 bytes of UTF-8 for hex 0 to 7ff might focus on western languages, 3 bytes for up to 16 bits of Unicode for all CJK glyphs and almost all other writing systems is okay. Of course CJK people might still prefer then-smaller UTF-16? > Right, but most Unicode-aware software isn't combining friendly Dunno. I get the impression that "your mileage may vary", in particular if you use rare (combinations of) accents. Also, not all software uses accents in a well-defined way either. > Well, I was just thinking how to save space. Even if you precompose chars with their accents, it will compress quite well as a font file ;-) > But do most people even view or edit multiple languages I remember that the Cyberbit bitstream TTF font also had a non-CJK edition which is only a few 100 kB AFAIR while still covering many languages. Maybe no Cherokee or such. > I forgot that the DPMI standard supports 286 and 386, but writing a > TSR for DPMI is pretty much hard to (not quite) impossible (and ugly). I somehow doubt that. You could do something small and evil such as hooking basic int 10 functions like function 0e, TTY. No need to to big complex multi interrupt many I/O and API activities etc stuff. Just receive text and render graphics. Of coure it will not work with apps which write to b800:xyz, so trapping and redirecting that would be the bonus exercise but I think I even did that in real mode once. Not as a real trap but keeping 128 kB of graphics RAM from a000 to bfff on and periodically checking b800:xyz for changes which would then be rendered with a font as graphics. Very long ago ;-) Eric :-) ------------------------------------------------------------------------------ All of the data generated in your IT infrastructure is seriously valuable. Why? It contains a definitive record of application performance, security threats, fraudulent activity, and more. Splunk takes this data and makes sense of it. IT sense. And common sense. http://p.sf.net/sfu/splunk-d2d-c2 _______________________________________________ Freedos-user mailing list Freedos-user@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/freedos-user