On Tue, 8 Nov 2011 23:04:25 -0600 (CST) Robert Bonomi <bon...@mail.r-bonomi.com> wrote:
> > "Conrad J. Sabatier" <conr...@cox.net> wrote: > > > > <grin> > > > > Yes, and this is one area where the labels are more than a little > > misleading as well. My natural inclination is think of UTF-8 as > > being a single-byte representation for each character in the set, > > whereas UTF-16, as the name implies, would be the "wide", 2-byte > > version. > > "Not exactly." > > > Nonetheless, as I posted earlier in this thread, according to the > > info in gucharmap, the representations of the umlauted "u" are just > > the opposite of this: > > "not exactly." Again. > > > UTF-8: 0xC3 0xBC > > UTF-16: 0x00FC > > > > Go figure, huh? :-) > > In UTF-16, everything _is_ a 16-bit entity. Notice that 0x00FC has > -four- nybbles after the '0x.' Every character boundary is on a > multiple of 16 bits. Ah yes! I hadn't noticed that. What's really weird, as I mentioned in a later private email to Polytropon, last night, the copy-and-paste in gucharmap suddenly decided to start copying the UTF-8 code instead of the UTF-16. I have no idea why that changed. > In UTF-8, the 'base' charset -- the 'C0' and 'C1' groups are > represented by a single byte. 'extended' characters are represented > by two bytes. Thus, 'characters' have a *variable*length* > representation -- one or two bytes. A character, whether it is > represented by one or two bytes, can begin on -any- byte boundary > within a data stream, depending on 'what came before it'. UTF-8 > 2-byte representations are designed such that one can jump to any > _byte_ offset within the file, and determine -- by looking *only* at > the value of that byte whether is is (a) a single-byte character, (b) > the first byte of a two-byte sequence, or (c) the second byte of a > two-byte sequence. > > With UTF-16 you can position directly to any -character-, by jumping > to a _byte_ offset that is twice the index of the character you want. > Given a byte offset, you always know the 'equivalent' _character_ > offset. > > With UTF-8, you have to read the character stream, counting > 'characters' as you go, to get to the desired point. You can seek to > an arbitrary _byte_ offset, but you do not know how mny 'characters' > into the file that offset is. I see. Yes, that could certainly complicate things. > UTF-8 vs. UTF-16 is a trade-off between 'compactness' (UTF-8), and > simplicity of addessing/representation (UTF-16). > > > This seems rather unfortunate to me. You would think that, by now, > > some "standard" character set might have emerged that would allow > > one to use, at the very least, the "Western" characters (as opposed > > to the "Eastern" or "Oriental" or "Asian", if you will) with a > > reasonable expectation that others will see what was intended. > > Heh. > > How many 'character' codes are you willing to devote to national > 'currency symbols', just for starters? Probable minimum of two per > currency -- one for the minimum coinage unit (cent, pence, pfennig, > etc.) and one for the denomination unit (dollar, pound, mark, kroner, > etc.) > > Now, one (obviously) has to have the basic 'Roman' alphabet. > > Then there are all the diacritical markings (accent, accent grave, dot > umlaut, ring, bar, 'hat', inverted hat, etc.) for vowels. And > cedilla, tilde, etc., for select consonants. Plus language specific > symbols like ess-zett , 'thorn', etc. > > How about phonetic symbols, like 'schwa' ? > > And Greek for all sorts of scientific use? > > What about Cyrilic characters, for many Eastern Eurpean languages? > > Now, consider punctuation marks: > the 'typewriter' basics, > How many of 'minus-sign, hyphen, em-dash, en-dash, soft-hyphen' > are needed? How many of 'accent, accent grave, apostrophe, > opening/closing single-quote' are needed? > opening/closing double-quotes, and/or a 'position neutral' > double-quote? > > "Other symbols", like -- > digits, > common fractions, > 'Trademark','Registered trademark','copyright' > 'paragraph','section', > superscripts -- exponents, footnotes, etc. > subscripts -- chemical formulae, etc. > "Simple line-drawing graphics" > > Diphthongs?? Ligatures?? > > Start counting things up. > > An 8-bit 'address space' gets used used up _really_ quick. > > <wry grin> I certainly get the point. :-) Thanks for that very thorough elucidation. :-) Now I just have to figure out what the heck's going on here, why suddenly I'm seeing the exact opposite of what I was seeing yesterday. Thought I had everything straightened out for a while there. :-( Oh, this is madness! :-) -- Conrad J. Sabatier conr...@cox.net _______________________________________________ freebsd-questions@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-questions To unsubscribe, send any mail to "freebsd-questions-unsubscr...@freebsd.org"