On 10.06.2013 09:10, nagia.rets...@gmail.com wrote:
Τη Κυριακή, 9 Ιουνίου 2013 3:31:44 μ.μ. UTC+3, ο χρήστης Steven D'Aprano έγραψε:
py> c = 'α'
py> ord(c)
945
The number 945 is the characters 'α' ordinal value in the unicode charset
correct?
Yes, the unicode character set is just a big list of characters. The
946th character in that list (starting from 0) happens to be 'α'.
The command in the python interactive session to show me how many bytes
this character will take upon encoding to utf-8 is:
s = 'α'
s.encode('utf-8')
b'\xce\xb1'
I see that the encoding of this char takes 2 bytes. But why two exactly?
That's how the encoding is designed. Haven't you read the wikipedia
article which was already mentioned several times?
How do i calculate how many bits are needed to store this char into bytes?
You need to understand how UTF-8 works. Read the wikipedia article.
Trying to to the same here but it gave me no bytes back.
s = 'a'
s.encode('utf-8')
b'a'
The encode method returns a byte object. It's length will tell you how
many bytes there are:
>>> len(b'a')
1
>>> len(b'\xce\xb1')
2
The python interpreter will represent all values below 256 as ASCII
characters if they are printable:
>>> ord(b'a')
97
>>> hex(97)
'0x61'
>>> b'\x61' == b'a'
True
The Python designers have decided to use b'a' instead of b'\x61'.
py> c.encode('utf-8')
b'\xce\xb1'
2 bytes here. why 2?
Same as your first question.
py> c.encode('utf-16be')
b'\x03\xb1'
2 byets here also. but why 3 different bytes? the ordinal value of
char 'a' is the same in unicode. the encodign system just takes the
ordinal value end encode, but sinc eit uses 2 bytes should these 2 bytes
be the same?
'utf-16be' is a different encoding scheme, thus it uses other rules to
determine how each character is translated into a byte sequence.
py> c.encode('iso-8859-7')
b'\xe1'
And also does '\x' means that the value is being respresented in hex way?
and when i bin(6) i see '0b1000001'
I should expect to see 8 bits of 1s and 0's. what the 'b' is tryign to say?
'\x' is an escape sequence and means that the following two characters
should be interpreted as a number in hexadecimal notation (see also the
table of allowed escape sequences:
http://docs.python.org/3/reference/lexical_analysis.html#string-and-bytes-literals
).
'0b' tells you that the number is printed in binary notation.
Leading zeros are usually discarded when a number is printed:
>>> bin(70)
'0b1000110'
>>> 0b100110 == 0b00100110
True
>>> 0b100110 == 0b0000000000100110
True
It's the same with decimal notation. You wouldn't say 00123 is different
from 123, would you?
Bye, Andreas
--
http://mail.python.org/mailman/listinfo/python-list