Re: Changing filenames from Greeklish => Greek (subprocess complain)

Andreas Perstinger Mon, 10 Jun 2013 01:19:44 -0700

On 10.06.2013 09:10, nagia.rets...@gmail.com wrote:

Τη Κυριακή, 9 Ιουνίου 2013 3:31:44 μ.μ. UTC+3, ο χρήστης Steven D'Aprano έγραψε:

py> c = 'α'
py> ord(c)
945


The number 945 is the characters 'α' ordinal value in the unicode charset 
correct?

Yes, the unicode character set is just a big list of characters. The946th character in that list (starting from 0) happens to be 'α'.

The command in the python interactive session to show me how many bytes
this character will take upon encoding to utf-8 is:

s = 'α'
s.encode('utf-8')

b'\xce\xb1'

I see that the encoding of this char takes 2 bytes. But why two exactly?

That's how the encoding is designed. Haven't you read the wikipediaarticle which was already mentioned several times?

How do i calculate how many bits are needed to store this char into bytes?


You need to understand how UTF-8 works. Read the wikipedia article.

Trying to to the same here but it gave me no bytes back.

s = 'a'
s.encode('utf-8')

b'a'

The encode method returns a byte object. It's length will tell you howmany bytes there are:


>>> len(b'a')
1
>>> len(b'\xce\xb1')
2

The python interpreter will represent all values below 256 as ASCIIcharacters if they are printable:


>>> ord(b'a')
97
>>> hex(97)
'0x61'
>>> b'\x61' == b'a'
True

The Python designers have decided to use b'a' instead of b'\x61'.

py> c.encode('utf-8')
b'\xce\xb1'


2 bytes here. why 2?


Same as your first question.

py> c.encode('utf-16be')
b'\x03\xb1'


2 byets here also. but why 3 different bytes? the ordinal value of
char 'a' is the same in unicode. the encodign system just takes the
ordinal value end encode, but sinc eit uses 2 bytes should these 2 bytes
be the same?

'utf-16be' is a different encoding scheme, thus it uses other rules todetermine how each character is translated into a byte sequence.

py> c.encode('iso-8859-7')
b'\xe1'


And also does '\x' means that the value is being respresented in hex way?
and when i bin(6) i see '0b1000001'

I should expect to see 8 bits of 1s and 0's. what the 'b' is tryign to say?

'\x' is an escape sequence and means that the following two charactersshould be interpreted as a number in hexadecimal notation (see also thetable of allowed escape sequences:http://docs.python.org/3/reference/lexical_analysis.html#string-and-bytes-literals).


'0b' tells you that the number is printed in binary notation.
Leading zeros are usually discarded when a number is printed:
>>> bin(70)
'0b1000110'
>>> 0b100110 == 0b00100110
True
>>> 0b100110 == 0b0000000000100110
True

It's the same with decimal notation. You wouldn't say 00123 is differentfrom 123, would you?


Bye, Andreas
--
http://mail.python.org/mailman/listinfo/python-list

Re: Changing filenames from Greeklish => Greek (subprocess complain)

Reply via email to