encoding problem
The below snippet code generates UnicodeDecodeError. #!/usr/bin/env python #--*-- coding: utf-8 --*-- s = 'äöü' u = unicode(s) It seems that the system use the default encoding- ASCII to decode the utf8 encoded string literal, and thus generates the error. The question is why the Python interpreter use the default encoding instead of "utf-8", which I explicitly declared in the source. -- http://mail.python.org/mailman/listinfo/python-list
Re: encoding problem
On 12月19日, 下午9时34分, Marc 'BlackJack' Rintsch wrote: > On Fri, 19 Dec 2008 04:05:12 -0800, digisat...@gmail.com wrote: > > The below snippet code generates UnicodeDecodeError. > > #!/usr/bin/env > > python > > #--*-- coding: utf-8 --*-- > > s = 'äöü' > > u = unicode(s) > > > It seems that the system use the default encoding- ASCII to decode the > > utf8 encoded string literal, and thus generates the error. > > > The question is why the Python interpreter use the default encoding > > instead of "utf-8", which I explicitly declared in the source. > > Because the declaration is only for decoding unicode literals in that > very source file. > > Ciao, > Marc 'BlackJack' Rintsch Thanks for the answer. I believe the declaration is not only for unicode literals, it is for all literals in the source even including Comments. we can try runing a source file without encoding declaration and have only 1 line of comments with non-ASCII characters. That will arise a Syntax error and bring me to the pep263 URL. I read the pep263 and quoted below: Python's tokenizer/compiler combo will need to be updated to work as follows: 1. read the file 2. decode it into Unicode assuming a fixed per-file encoding 3. convert it into a UTF-8 byte string 4. tokenize the UTF-8 content 5. compile it, creating Unicode objects from the given Unicode data and creating string objects from the Unicode literal data by first reencoding the UTF-8 data into 8-bit string data using the given file encoding The above described Python internal process indicate that the step 2 will utilise the specific encoding to decode all literals in source, while in step5 will evolve a re-encoding with the specific encoding. That is the reason why we have to explicitly declare a encoding as long as we have non-ASCII in source. Bruno answered why we need specify a encoding when decoding a byte string with perfect explanation, Thank you very much. -- http://mail.python.org/mailman/listinfo/python-list
expandtabs acts unexpectedly
Python 2.6.2 (release26-maint, Apr 19 2009, 01:56:41) [GCC 4.3.3] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> ' test\ttest'.expandtabs(4) ' test test' >>> 'test \ttest'.expandtabs(4) 'testtest' 1st example: expect returning 4 spaces between 'test', 3 spaces returned 2nd example: expect returning 5 spaces between 'test', 4 spaces returned Is it a bug or something, please advice. -- http://mail.python.org/mailman/listinfo/python-list
Re: expandtabs acts unexpectedly
On Aug 19, 4:16 pm, Peter Brett wrote: > "digisat...@gmail.com" writes: > > Python 2.6.2 (release26-maint, Apr 19 2009, 01:56:41) > > [GCC 4.3.3] on linux2 > > Type "help", "copyright", "credits" or "license" for more information. > >>>> ' test\ttest'.expandtabs(4) > > ' test test' > >>>> 'test \ttest'.expandtabs(4) > > 'test test' > > > 1st example: expect returning 4 spaces between 'test', 3 spaces > > returned > > 2nd example: expect returning 5 spaces between 'test', 4 spaces > > returned > > > Is it a bug or something, please advice. > > Consider where the 4-space tabstops are relative to those strings: > > test test > test test > ^ ^ ^ > > So no, it's not a bug. > > If you just want to replace the tab characters by spaces, use: > > >>> " test\ttest".replace("\t", " ") > ' test test' > >>> "test \ttest".replace("\t", " ") > 'test test' > > HTH, > > Peter > > -- > Peter Brett > Remote Sensing Research Group > Surrey Space Centre You corrected me for the understanding of tab stop. Great explanation. Thank you so much. -- http://mail.python.org/mailman/listinfo/python-list