On Sat, 19 Mar 2016 11:24 pm, BartC wrote about combining characters: > So a string that looks like: > > "ññññññññññññññññññññññññññññññññññññññññññññññññññ" > > can have 2**50 different representations?
Yes. > And occupy somewhere between 50 and 200 bytes? Or is that 400? The minimum storage would use a legacy encoding (like MacRoman, or Latin-1) with the composed ñ character. That gives 50 x 1-byte characters, or 50 bytes. The maximum storage would be if all 50 characters were decomposed into two code points (giving 100 code points), and then stored as UTF-32, giving 400 bytes all up. > OK... You say that as if 400 bytes was a lot. Besides, this is hardly any different from (say) a pure ASCIII version of the "permille" (per thousand) symbol. In Unicode I can write ‰ (two bytes in UTF-16) but in ASCII I am forced to write O/oo (four bytes), or worse, "per thousand" (12 bytes). Imagine a string of "‰"*50, written in ASCII, for a total of 600 bytes... Yes, this is silly. Really, if you've got 50 ñ in a string, they take up the space they take up, and memory is cheap. The days of thinking that 127 characters is all you need (7 bit ASCII) are long, long gone, just like the days when it was appropriate for ints to be 16 bits. When I first started programming, the default "integer" type in Pascal, Forth and other languages was 16 bits, which meant that the largest number you can represent in a calculation was 32767. My four-function calculator had an 8 digit display and could calculate up to 99999999, while Pascal choked on 32767. (Or 65536 if you used unsigned numbers.) Now, I routinely and without hesitation generate thousand-plus bit numbers like 2**10000, and my computer calculates and prints the result faster than I can enter the calculation in the first place. Worrying about the fact that characters use more than 8 bits is oh-so-1990s. -- Steven -- https://mail.python.org/mailman/listinfo/python-list