On Sun, Jun 9, 2013 at 4:01 AM, Νικόλαος Κούρας <nikos.gr...@gmail.com> wrote: > Hold on! > > In the beginning there was ASCII with 0-127 values and then there was > Unicode with 0-127 of ASCII's + i dont know how much many more? > > Now ASCIII needs 1 byte to store a single character while Unicode needs 2 > bytes to store a character and that is because it has > 256 characters to > store > 2^8bits ? > > Is this correct?
No. Let me start from the beginning. Computers don't work with characters, or strings, natively. They work with numbers. To be specific, they work with bits; and it's only by convention that we can work with anything larger. For instance, there's a VERY common convention around the PC world that a set of bits can be interpreted as a signed integer; if the highest bit is set, it's negative. There are also standards for floating-point (IEEE 754), and so on. ASCII is a character set. It defines a mapping of numbers to characters - for instance, @ is 64, SOH is 1, $ is 36, etcetera, etcetera. There are 128 such mappings. Since they all fit inside a 7-bit number, there's a trivial way to represent ASCII characters in a PC's 8-bit byte: you just leave the high bit clear and use the other seven. There have been various schemes for using the eighth bit - serial ports with parity, WordStar (I think) marking the ends of words, and most notably, Extended ASCII schemes that give you another whole set of 128 characters. And that was the beginning of Code Pages, because nobody could agree on what those extra 128 should be. Norwegians used Norwegian, the Greeks were taught their Greek, Arabians created themselves an Arabian codepage with the speed of summer lightning, and Hebrews allocated from 255 down to 128, which is absolutely frightening. But I digress. There were a variety of multi-byte schemes devised at various times, but we'll ignore all of them and jump straight to Unicode. With Unicode, there's (theoretically) no need to use any other system ever again, because whatever character you want, it'll exist in Unicode. In theory, of course; there are debates over that. Now, Unicode currently has defined an "address space" of roughly 20 bits, and in a throwback to the first programming I ever did, it's a segmented system: sixteen or seventeen planes of 65,536 characters each. (Fortunately the planes are identified by low numbers, not high numbers, and there's no stupidity of overlapping planes the way the 8086 did with memory!) The highest planes are special (plane 14 has a few special-purpose characters, planes 15 and 16 are for private use), and most of the middle ones have no characters assigned to them, so for the most part, you'll see characters from the first three planes. So what do we now have? A mapping of characters to "code points", which are numbers. (I'm leaving aside the issues of combining characters and such for the moment.) But computers don't work with numbers, they work with bits. Somehow we have to store those bits in memory. There are a good few ways to do that; one is to note that every Unicode character can be represented inside 32 bits, so we can use the standard integer scheme safely. (Since they fit inside 31 bits, we don't even need to care if it's signed or unsigned.) That's called UTF-32 or UCS-4, and it's a great way to handle the full Unicode range in a manner that makes a Texan look agoraphobic. Wide builds of Python up to 3.2 did this. Or you can try to store them in 16-bit numbers, but then you have to worry about the ones that don't fit in 16 bits, because it's really hard to squeeze 20 bits of information into 16 bits of storage. UTF-16 is one way to do this; special numbers mean "grab another number". It has its issues, but is (in my opinion, unfortunately) fairly prevalent. Narrow builds of Python up to 3.2 did this. Finally, you can use a more complicated scheme that uses anywhere from 1 to 4 bytes for each character, by carefully encoding information into the top bit - if it's set, you have a multi-byte character. That's how UTF-8 works, and is probably the most prevalent disk/network encoding. All of the UTF-X systems are called "UCS Transformation Formats" (UCS meaning Universal Character Set, roughly "Unicode"). They are mappings from Unicode numbers to bytes. Between Unicode and UTF-X, you have a mapping from character to byte sequence. > Now UTF-8, latin-iso, greek-iso e.t.c are WAYS of storing characters into > the hard drive? The ISO standard 8859 specifies a number of ASCII-compatible encodings, referred to as ISO-8859-1 through ISO-8859-16. You've been working with ISO-8859-1, also called Latin-1, and ISO-8859-7, which has your Greek characters in it. These are all ways of translating characters into numbers; and since they all fit within 8 bits, they're most commonly represented on PCs with single bytes. > So taken form above example(the closest i could think of), the way i > understand them is: > > A 'string' can be of (unicode's or ascii's) type and that type needs a way > (thats a charset) to store this string into the hdd as a sequense of bytes? A Python 3 'string' is always a series of Unicode characters. How they're represented in memory doesn't matter, but as of Python 3.3 that's a fairly compact and efficient system that can omit unnecessary zero bits. To store that string on your hard disk, send it across a network, or transmit it to another process, you need to encode it as bytes, somehow. The UCS Transformation Formats are specifically designed for this, and most of the time, UTF-8 is going to be the best option. It's compact, it's well known, and usually, it'll do everything you want. The only thing it won't do is let you quickly locate the Nth character, which is why it makes a poor in-memory format. Fortunately, Python lets us hide away pretty much all those details, just as it lets us hide away the details of what makes up a list, a dictionary, or an integer. You can safely assume that the string "foo" is a string of three characters, which you can work with as characters. The chr() and ord() functions let you switch between characters and numbers, and str.encode() and bytes.decode() let you switch between characters and byte sequences. Once you get your head around the differences between those three, it all works fairly neatly. Chris Angelico -- http://mail.python.org/mailman/listinfo/python-list