At 21 Jan 2005 19:31:13 -0800, Thomas Bushnell BSG wrote: > > Marcus Brinkmann <[EMAIL PROTECTED]> writes: > > > UTF-8 is an insanely complex standard, if you start to look down its > > depths. > > UTF-8 is a complex standard. It is not insanely so. It is complex > because it is representing a very complex problem.
Oh, sure. The insanity starts if you talk about using "UTF-8" for things like filenames without being very exact in what you mean by that. The implications of putting the complex system UTF-8 into a POSIX-like operating systems as they exist today are not well understood, and the resulting lose ends, conflicts, etc are not resolved as of today. So, the phrase "do the right thing with UTF-8" is subject to substantial interpretation. My summary was intended to show that given todays understanding of the above situation, I believe we do the "right thing with UTF-8". More specifically (and please also see the quote below), we only support specific scripts at Unicode Level 1 (ISO 10646-1). I don't think we disagree, and I am not really ranting, so there is not much left to say I guess. But just to be clear: I am just as much as any geek gung ho about seeing tibetian quotations in a russian mail about some math problems that's in my inbox along with the korean spam - and everything out of the box on the text console. The essence of what I wrote is just that neither is UTF-8 the hammer for every nail (you will always find people who feel their script is misrepresented in Unicode), nor is it really clear what practical UTF-8 support means nowadays. To some substantial amount, it is still experimental and work in progress. People are working on it of course, and if POSIX demands that file name lookups are done by comparing the Normalization Form C of each string we should and will implement this in libdiskfs etc. We should walk this march in lock-step with the rest of the world, and let them do the work for us figuring out what needs to be done. No more and no less, I think. The UTF-8 and Unicode FAQ for Unix/Linux can be found here: http://www.cl.cam.ac.uk/~mgk25/unicode.html One paragraph is particularly interesting: "Full Unicode functionality with all bells and whistles (e.g. high-quality typesetting of the Arabic and Indic scripts) can only be expected from sophisticated multi-lingual word-processing packages. What Linux supports today on a broad base is far simpler and mainly aimed at replacing the old 8- and 16-bit character sets. Linux terminal emulators and command line tools usually only support a Level 1 implementation of ISO 10646-1 (no combining characters), and only scripts such as Latin, Greek, Cyrillic, Armenian, Georgian, CJK, and many scientific symbols are supported that need no further processing support. At this level, UCS support is very comparable to ISO 8859 support and the only significant difference is that we have now thousands of different characters available, that characters can be represented by multibyte sequences, and that ideographic Chinese/Japanese/Korean characters require two terminal character positions (double-width)." We don't have support for ideographic CJK characters, I didn't know how to implement that and thought it would be better left to somebody actually writes such things (I still try to write my ideographic CJK characters with a calligraphy brush). But apart from that, this is about the level we support things, and I did that pretty purposefully. Take it easy ;) Thanks, Marcus _______________________________________________ Bug-hurd mailing list Bug-hurd@gnu.org http://lists.gnu.org/mailman/listinfo/bug-hurd