On 10.06.2013 09:10, nagia.rets...@gmail.com wrote:
Τη Κυριακή, 9 Ιουνίου 2013 3:31:44 μ.μ. UTC+3, ο χρήστης Steven D'Aprano έγραψε:

py> c = 'α'
py> ord(c)
945

The number 945 is the characters 'α' ordinal value in the unicode charset 
correct?

Yes, the unicode character set is just a big list of characters. The 946th character in that list (starting from 0) happens to be 'α'.

The command in the python interactive session to show me how many bytes
this character will take upon encoding to utf-8 is:

s = 'α'
s.encode('utf-8')
b'\xce\xb1'

I see that the encoding of this char takes 2 bytes. But why two exactly?

That's how the encoding is designed. Haven't you read the wikipedia article which was already mentioned several times?

How do i calculate how many bits are needed to store this char into bytes?

You need to understand how UTF-8 works. Read the wikipedia article.

Trying to to the same here but it gave me no bytes back.

s = 'a'
s.encode('utf-8')
b'a'

The encode method returns a byte object. It's length will tell you how many bytes there are:

>>> len(b'a')
1
>>> len(b'\xce\xb1')
2

The python interpreter will represent all values below 256 as ASCII characters if they are printable:

>>> ord(b'a')
97
>>> hex(97)
'0x61'
>>> b'\x61' == b'a'
True

The Python designers have decided to use b'a' instead of b'\x61'.

py> c.encode('utf-8')
b'\xce\xb1'

2 bytes here. why 2?

Same as your first question.

py> c.encode('utf-16be')
b'\x03\xb1'

2 byets here also. but why 3 different bytes? the ordinal value of
char 'a' is the same in unicode. the encodign system just takes the
ordinal value end encode, but sinc eit uses 2 bytes should these 2 bytes
be the same?

'utf-16be' is a different encoding scheme, thus it uses other rules to determine how each character is translated into a byte sequence.

py> c.encode('iso-8859-7')
b'\xe1'

And also does '\x' means that the value is being respresented in hex way?
and when i bin(6) i see '0b1000001'

I should expect to see 8 bits of 1s and 0's. what the 'b' is tryign to say?

'\x' is an escape sequence and means that the following two characters should be interpreted as a number in hexadecimal notation (see also the table of allowed escape sequences: http://docs.python.org/3/reference/lexical_analysis.html#string-and-bytes-literals ).

'0b' tells you that the number is printed in binary notation.
Leading zeros are usually discarded when a number is printed:
>>> bin(70)
'0b1000110'
>>> 0b100110 == 0b00100110
True
>>> 0b100110 == 0b0000000000100110
True

It's the same with decimal notation. You wouldn't say 00123 is different from 123, would you?

Bye, Andreas
--
http://mail.python.org/mailman/listinfo/python-list

Reply via email to