[ python-Bugs-1743795 ] Some incorrect national characters (Polish) in unicodedata
Bugs item #1743795, was opened at 2007-06-26 20:45 Message generated for change (Comment added) made by lemburg You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1743795&group_id=5470 Please note that this message will contain a full copy of the comment thread, including the initial issue submission, for this request, not just the latest update. Category: Unicode Group: Python 2.5 Status: Open Resolution: None Priority: 5 Private: No Submitted By: admindomeny (admindomeny) Assigned to: M.-A. Lemburg (lemburg) Summary: Some incorrect national characters (Polish) in unicodedata Initial Comment: Hello, This problem regards pythonwin (I haven't checked whether unix/commandline python is affected), Python 2.5.1. Examples on attached screenshot. E.g. print u'\N{LATIN SMALL LETTER A WITH CIRCUMFLEX}' prints wrong character (latin small a with some caret above it it seems) as well as print unicodedata.name( / latin small letter a with circumflex, typed in Windows using Polish "programmer's keyboard" / ) produces 'SUPERSCRIPT ONE', which is obviously incorrect. -- >Comment By: M.-A. Lemburg (lemburg) Date: 2007-06-27 10:28 Message: Logged In: YES user_id=38388 Originator: NO This sounds more like a problem with entry of Unicode characters in pythonwin than the unicodedata module. Please create a test.py file with the character using e.g. UTF-8 as source code encoding and run that through the Python interpreter directly to see if the problem persists. -- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1743795&group_id=5470 ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[ python-Bugs-1636950 ] Newline skipped in "for line in file"
Bugs item #1636950, was opened at 2007-01-16 17:56 Message generated for change (Comment added) made by runedevik You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1636950&group_id=5470 Please note that this message will contain a full copy of the comment thread, including the initial issue submission, for this request, not just the latest update. Category: Python Library Group: Python 2.5 Status: Closed Resolution: Invalid Priority: 5 Private: No Submitted By: Andy Monthei (amonthei) Assigned to: Nobody/Anonymous (nobody) Summary: Newline skipped in "for line in file" Initial Comment: When processing huge fixed block files of about 7000 bytes wide and several hundred thousand lines long some pairs of lines get read as one long line with no line break when using "for line in file:". The problem is even worse when using the fileinput module and reading in five or six huge files consisting of 4.8 million records causes several hundred pairs of lines to be read as single lines. When a newline is skipped it is usually followed by several more in the next few hundred lines. I have not noticed any other characters being skipped, only the line break. O.S. Windows (5, 1, 2600, 2, 'Service Pack 2') Python 2.5 -- Comment By: Rune Devik (runedevik) Date: 2007-06-27 12:00 Message: Logged In: YES user_id=1212666 Originator: NO Hi I have the same problem with a huge file (8GB) containing long lines. Sometimes two lines are merged into one and rerunning the test script that reads the file it's always the same lines that are merged. Also the merging happens more frequently towards the end of the file it seems. I tried to reproduce with a smaller data set (10 lines before the two lines that get merged, the two lines that gets merged and the 10 lines after that) but I was not able to reproduce on this smaller data set. However if you open this huge file in "rb" mode instead of "r" mode everything works as it should and no lines are merged at all! If I copy the file over to linux and rerun the test script no lines are merged (regardless if mode is "r" or "rb") so this is windows specific and might have something todo with the adding of \r\n if only \n is found when you open the file in "r" mode maybe? Also I have reproduced it on both python 2.3.5 and 2.5c1 on both windows XP and windows 2003. More stats on the input file in both "r" mode and "rb" mode below: Input file size: 8 695 828 KB fp = open(file, "r"): - total number of lines read: 668909 - length of the longest line: 13179792 - length of the shortest line: 89 - 56 lines contains the content of two lines - Always just two lines that are merged into one! - Always the same lines that are merged rerunning the test on the same file. open(file, "rb"): - total number of lines read: 668965 - length of the longest line: 13179793 - length of the shortest line: 90 - no lines merged Regards, Rune Devik -- Comment By: Brett Cannon (bcannon) Date: 2007-01-21 01:46 Message: Logged In: YES user_id=357491 Originator: NO Well, with Andy saying he can't reproduce the problem I am going to close as invalid. Andy, if you ever happen to be able to upload data that triggers it, then please re-open this bug. -- Comment By: Andy Monthei (amonthei) Date: 2007-01-20 23:53 Message: Logged In: YES user_id=1693612 Originator: YES I have had no luck creating random data to reproduce the problem which leaves me to come to the conclusion that it was the data itself. Using a hex editor I find no problem with the line breaks. The data that triggers this bug is transferred several time before it gets to me. It originates on a Unix box, then goes to an IBM mainframe, then to my Windows machine and through many updates along the way. It may be an EBCDIC/ASCII conversion or possibly something to do with the mainframe to PC transfer. Whatever it is, it's in the data itself. The only thing that bothers me is that Java somehow is not affected by this bad data. -- Comment By: Andy Monthei (amonthei) Date: 2007-01-18 16:34 Message: Logged In: YES user_id=1693612 Originator: YES I am using open() for reading the file, no other features. I have also had fileinput.input(fileList) compound the problem. Each file that this has happened to is a fixed block file of either 6990 or 7700 bytes wide but this I think is insignificant. When looking at the file in a hex editor everything looks fine and a small Java program using a buffered reader will give me the correct line count when Python does not. Using something like fp.read(8192) I'm sure might temporarily solve my problem but I will keep working on getting a file I can upload.
[ python-Bugs-1668295 ] Strange unicode behaviour
Bugs item #1668295, was opened at 2007-02-25 06:10 Message generated for change (Comment added) made by massysett You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1668295&group_id=5470 Please note that this message will contain a full copy of the comment thread, including the initial issue submission, for this request, not just the latest update. Category: None Group: None Status: Closed Resolution: Invalid Priority: 5 Private: No Submitted By: Santiago Gala (sgala) Assigned to: Nobody/Anonymous (nobody) Summary: Strange unicode behaviour Initial Comment: I know that python is very funny WRT unicode processing, but this defies all my knowledge. I use the es_ES.UTF-8 encoding on linux. The script: python -c "print unicode('á %s' % 'éí','utf8') " works, i.e., prints á éí in the next line. However, if I redirect it to less or to a file, like python -c "print unicode('á %s' % 'éí','utf8') " >test Traceback (most recent call last): File "", line 1, in ? UnicodeEncodeError: 'ascii' codec can't encode character u'\xe1' in position 0: ordinal not in range(128) Why is the behaviour different when stdout is redirected? How can I get it to do "the right thing" in both cases? -- Comment By: Omari Norman (massysett) Date: 2007-06-27 07:29 Message: Logged In: YES user_id=1764292 Originator: NO The fix given by gbrandl, which is to use sys.stdout = codecs.EncodedFile(sys.stdout, 'utf-8') does not work. EncodedFile expects to receive encoded strings, so if you try to use it with Unicode strings, you get errors. I could of course I tried sys.stdout = codecs.open(sys.stdout, 'w', 'utf-8') but that gives me "Type Error: coercing to Unicode: need string or buffer, file found." Since this was (absurdly) closed as invalid, are there any good fixes that actually work? -- Comment By: Santiago Gala (sgala) Date: 2007-03-03 08:22 Message: Logged In: YES user_id=178886 Originator: YES >This is not magic. "print" looks for an "encoding" attribute on the file >it is printing to. This is the terminal encoding for sys.stdout and None >for other files. I'll correct you: "print" looks for an "encoding" attribute on the file it is printing to. This is the terminal encoding for sys.stdout *if sys.stdout is a terminal* and None when sys.stdout is not a terminal. After all, the bug reported is that *the same program* behaved different when used standalone than when piped to less: $ python -c "import sys; print sys.stdout.encoding" UTF-8 $ python -c "import sys; print sys.stdout.encoding" | cat None If you say that this is intended, not a bug, that an external process is altering the behavior of a python program, I'd just leave it written to warn other naive people like myself, that thinks that an external program should not influence python behavior (with *the same environment*): $ locale LANG=es_ES.UTF-8 LC_CTYPE="es_ES.UTF-8" LC_NUMERIC="es_ES.UTF-8" LC_TIME="es_ES.UTF-8" LC_COLLATE="es_ES.UTF-8" LC_MONETARY="es_ES.UTF-8" LC_MESSAGES="es_ES.UTF-8" LC_PAPER="es_ES.UTF-8" LC_NAME="es_ES.UTF-8" LC_ADDRESS="es_ES.UTF-8" LC_TELEPHONE="es_ES.UTF-8" LC_MEASUREMENT="es_ES.UTF-8" LC_IDENTIFICATION="es_ES.UTF-8" LC_ALL=es_ES.UTF-8 But I take it as a design flaw, and against all pythonic principles, probably coming from the fact that a lot of python developers/users are windows people that don't care about stdout at all. IMO, the behavior should be either: - use always None for sys.stdout - use always LC_CTYPE or LANG for sys.stdout I prefer the second one, as when I pipe stdout, after all, I expect it to be honoring my locale settings. Don't forget that the same person that types "|" after a call to python can type LC_ALL=blah before, while s/he can't sometimes modify the script because it is out of their permission set. The implementation logic would be simpler too, I guess. And more consistent with jython (it uses the second "always LC_CTYPE" solution). Not sure about iron-python or pypy. -- Comment By: Georg Brandl (gbrandl) Date: 2007-02-25 18:27 Message: Logged In: YES user_id=849994 Originator: NO > >>> sys.getfilesystemencoding() > 'UTF-8' > > so python is really dumb if print does not know my filesystemencoding, but > knows my terminal encoding. the file system encoding is the encoding of file names, not of file content. > I though breaking the least surprising behaviour was not considered > pythonic, and now you tell me that having a program running on console but > issuing an exception when redirected is intended. I would prefer an > exception in both cases. Or, even better, using > sys.getfilesystemencoding(), or allowing me to set defaultencoding() I agree that using the terminal encoding is perhaps a bit too DWIMish, but you can always get con
[ python-Bugs-1668295 ] Strange unicode behaviour
Bugs item #1668295, was opened at 2007-02-25 11:10 Message generated for change (Comment added) made by gbrandl You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1668295&group_id=5470 Please note that this message will contain a full copy of the comment thread, including the initial issue submission, for this request, not just the latest update. Category: None Group: None Status: Closed Resolution: Invalid Priority: 5 Private: No Submitted By: Santiago Gala (sgala) Assigned to: Nobody/Anonymous (nobody) Summary: Strange unicode behaviour Initial Comment: I know that python is very funny WRT unicode processing, but this defies all my knowledge. I use the es_ES.UTF-8 encoding on linux. The script: python -c "print unicode('á %s' % 'éí','utf8') " works, i.e., prints á éí in the next line. However, if I redirect it to less or to a file, like python -c "print unicode('á %s' % 'éí','utf8') " >test Traceback (most recent call last): File "", line 1, in ? UnicodeEncodeError: 'ascii' codec can't encode character u'\xe1' in position 0: ordinal not in range(128) Why is the behaviour different when stdout is redirected? How can I get it to do "the right thing" in both cases? -- >Comment By: Georg Brandl (gbrandl) Date: 2007-06-27 12:04 Message: Logged In: YES user_id=849994 Originator: NO Yes, I'm sorry my fix is bad, you should rather use sys.stdout = codecs.getwriter('utf-8')(sys.stdout) -- Comment By: Omari Norman (massysett) Date: 2007-06-27 11:29 Message: Logged In: YES user_id=1764292 Originator: NO The fix given by gbrandl, which is to use sys.stdout = codecs.EncodedFile(sys.stdout, 'utf-8') does not work. EncodedFile expects to receive encoded strings, so if you try to use it with Unicode strings, you get errors. I could of course I tried sys.stdout = codecs.open(sys.stdout, 'w', 'utf-8') but that gives me "Type Error: coercing to Unicode: need string or buffer, file found." Since this was (absurdly) closed as invalid, are there any good fixes that actually work? -- Comment By: Santiago Gala (sgala) Date: 2007-03-03 13:22 Message: Logged In: YES user_id=178886 Originator: YES >This is not magic. "print" looks for an "encoding" attribute on the file >it is printing to. This is the terminal encoding for sys.stdout and None >for other files. I'll correct you: "print" looks for an "encoding" attribute on the file it is printing to. This is the terminal encoding for sys.stdout *if sys.stdout is a terminal* and None when sys.stdout is not a terminal. After all, the bug reported is that *the same program* behaved different when used standalone than when piped to less: $ python -c "import sys; print sys.stdout.encoding" UTF-8 $ python -c "import sys; print sys.stdout.encoding" | cat None If you say that this is intended, not a bug, that an external process is altering the behavior of a python program, I'd just leave it written to warn other naive people like myself, that thinks that an external program should not influence python behavior (with *the same environment*): $ locale LANG=es_ES.UTF-8 LC_CTYPE="es_ES.UTF-8" LC_NUMERIC="es_ES.UTF-8" LC_TIME="es_ES.UTF-8" LC_COLLATE="es_ES.UTF-8" LC_MONETARY="es_ES.UTF-8" LC_MESSAGES="es_ES.UTF-8" LC_PAPER="es_ES.UTF-8" LC_NAME="es_ES.UTF-8" LC_ADDRESS="es_ES.UTF-8" LC_TELEPHONE="es_ES.UTF-8" LC_MEASUREMENT="es_ES.UTF-8" LC_IDENTIFICATION="es_ES.UTF-8" LC_ALL=es_ES.UTF-8 But I take it as a design flaw, and against all pythonic principles, probably coming from the fact that a lot of python developers/users are windows people that don't care about stdout at all. IMO, the behavior should be either: - use always None for sys.stdout - use always LC_CTYPE or LANG for sys.stdout I prefer the second one, as when I pipe stdout, after all, I expect it to be honoring my locale settings. Don't forget that the same person that types "|" after a call to python can type LC_ALL=blah before, while s/he can't sometimes modify the script because it is out of their permission set. The implementation logic would be simpler too, I guess. And more consistent with jython (it uses the second "always LC_CTYPE" solution). Not sure about iron-python or pypy. -- Comment By: Georg Brandl (gbrandl) Date: 2007-02-25 23:27 Message: Logged In: YES user_id=849994 Originator: NO > >>> sys.getfilesystemencoding() > 'UTF-8' > > so python is really dumb if print does not know my filesystemencoding, but > knows my terminal encoding. the file system encoding is the encoding of file names, not of file content. > I though breaking the least surprising behaviour was not considered > pythonic, and now you tell me that having a program running on console bu
[ python-Bugs-1743795 ] Some incorrect national characters (Polish) in unicodedata
Bugs item #1743795, was opened at 2007-06-26 18:45 Message generated for change (Comment added) made by admindomeny You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1743795&group_id=5470 Please note that this message will contain a full copy of the comment thread, including the initial issue submission, for this request, not just the latest update. Category: Unicode Group: Python 2.5 Status: Open Resolution: None Priority: 5 Private: No Submitted By: admindomeny (admindomeny) Assigned to: M.-A. Lemburg (lemburg) Summary: Some incorrect national characters (Polish) in unicodedata Initial Comment: Hello, This problem regards pythonwin (I haven't checked whether unix/commandline python is affected), Python 2.5.1. Examples on attached screenshot. E.g. print u'\N{LATIN SMALL LETTER A WITH CIRCUMFLEX}' prints wrong character (latin small a with some caret above it it seems) as well as print unicodedata.name( / latin small letter a with circumflex, typed in Windows using Polish "programmer's keyboard" / ) produces 'SUPERSCRIPT ONE', which is obviously incorrect. -- >Comment By: admindomeny (admindomeny) Date: 2007-06-27 17:25 Message: Logged In: YES user_id=1829093 Originator: YES You were correct, the attached test file for Polish national characters shows correctt character encodings when ran in Pythonwin and edited correctly Unicode with Polish characters from Unicode Unicode. The problem of entering characters in Pythonwin remains, however (OS: Win XP SP2, Polish edition): I have tried changing fonts to what are Unicode fonts as far as I know (Times New Roman, Arial, etc), including CE fonts as well. It doesn't work. I made sure that Polish Programmer's Keyboard is turned on which gives me correct encoding in almost all Windows applications, including Unicode editors like UniRed. Still, Pythonwin shell in particular thinks that AltGr+a (standard way of entering 'LATIN SMALL LETTER A WITH OGONEK') is actually 'SUPERSCRIPT ONE' for example. So, to summarize: 1. IDLE edits the text in Unicode correctly provided there's a #-*- coding: utf-8 -*- header in first line. 2. Pythonwin executes that file correctly. 3. Pythonwin enters national characters INCORRECTLY (at least as far Polish is concerned, but I suspect it's also the case with other languages). File Added: test.py -- Comment By: M.-A. Lemburg (lemburg) Date: 2007-06-27 08:28 Message: Logged In: YES user_id=38388 Originator: NO This sounds more like a problem with entry of Unicode characters in pythonwin than the unicodedata module. Please create a test.py file with the character using e.g. UTF-8 as source code encoding and run that through the Python interpreter directly to see if the problem persists. -- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1743795&group_id=5470 ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[ python-Bugs-1743795 ] Some incorrect national characters (Polish) in unicodedata
Bugs item #1743795, was opened at 2007-06-26 20:45 Message generated for change (Comment added) made by lemburg You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1743795&group_id=5470 Please note that this message will contain a full copy of the comment thread, including the initial issue submission, for this request, not just the latest update. Category: Unicode Group: Python 2.5 Status: Open Resolution: None Priority: 5 Private: No Submitted By: admindomeny (admindomeny) >Assigned to: Mark Hammond (mhammond) Summary: Some incorrect national characters (Polish) in unicodedata Initial Comment: Hello, This problem regards pythonwin (I haven't checked whether unix/commandline python is affected), Python 2.5.1. Examples on attached screenshot. E.g. print u'\N{LATIN SMALL LETTER A WITH CIRCUMFLEX}' prints wrong character (latin small a with some caret above it it seems) as well as print unicodedata.name( / latin small letter a with circumflex, typed in Windows using Polish "programmer's keyboard" / ) produces 'SUPERSCRIPT ONE', which is obviously incorrect. -- >Comment By: M.-A. Lemburg (lemburg) Date: 2007-06-27 21:38 Message: Logged In: YES user_id=38388 Originator: NO Assigning to Mark Hammond who wrote Pythonwin. -- Comment By: admindomeny (admindomeny) Date: 2007-06-27 19:25 Message: Logged In: YES user_id=1829093 Originator: YES You were correct, the attached test file for Polish national characters shows correctt character encodings when ran in Pythonwin and edited correctly Unicode with Polish characters from Unicode Unicode. The problem of entering characters in Pythonwin remains, however (OS: Win XP SP2, Polish edition): I have tried changing fonts to what are Unicode fonts as far as I know (Times New Roman, Arial, etc), including CE fonts as well. It doesn't work. I made sure that Polish Programmer's Keyboard is turned on which gives me correct encoding in almost all Windows applications, including Unicode editors like UniRed. Still, Pythonwin shell in particular thinks that AltGr+a (standard way of entering 'LATIN SMALL LETTER A WITH OGONEK') is actually 'SUPERSCRIPT ONE' for example. So, to summarize: 1. IDLE edits the text in Unicode correctly provided there's a #-*- coding: utf-8 -*- header in first line. 2. Pythonwin executes that file correctly. 3. Pythonwin enters national characters INCORRECTLY (at least as far Polish is concerned, but I suspect it's also the case with other languages). File Added: test.py -- Comment By: M.-A. Lemburg (lemburg) Date: 2007-06-27 10:28 Message: Logged In: YES user_id=38388 Originator: NO This sounds more like a problem with entry of Unicode characters in pythonwin than the unicodedata module. Please create a test.py file with the character using e.g. UTF-8 as source code encoding and run that through the Python interpreter directly to see if the problem persists. -- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1743795&group_id=5470 ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[ python-Feature Requests-1720992 ] automatic imports
Feature Requests item #1720992, was opened at 2007-05-17 18:36 Message generated for change (Comment added) made by aflag You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=355470&aid=1720992&group_id=5470 Please note that this message will contain a full copy of the comment thread, including the initial issue submission, for this request, not just the latest update. Category: None Group: None Status: Open Resolution: None Priority: 5 Private: No Submitted By: Juan Manuel Borges Caño (juanmabc3) Assigned to: Nobody/Anonymous (nobody) Summary: automatic imports Initial Comment: I don't need to declare a variable but I need to declare a module, i.e import module, can this be done automatically?, so time.strftime requires import time automatically in the spirit of the python language, it saves typing and synchronization of the imports with the changes of the source code -- Comment By: Rafael Cunha de Almeida (aflag) Date: 2007-06-28 00:35 Message: Logged In: YES user_id=856271 Originator: NO I don't think this is a very good idea, if you'll be using the module name all the time, then you might as well simple import it. Besides, I think it could lead to confusion. You may have an object named fo in your script and if you mistakenly type foo a module will get imported and you won't understand anything. Specially if the module is something you don't even know about in your PYTHONPATH. -- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=355470&aid=1720992&group_id=5470 ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[ python-Bugs-1744580 ] cvs.get_dialect() return a class object
Bugs item #1744580, was opened at 2007-06-28 05:36 Message generated for change (Tracker Item Submitted) made by Item Submitter You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1744580&group_id=5470 Please note that this message will contain a full copy of the comment thread, including the initial issue submission, for this request, not just the latest update. Category: Python Library Group: Python 2.5 Status: Open Resolution: None Priority: 5 Private: No Submitted By: Christian Kristukat (ckkart) Assigned to: Nobody/Anonymous (nobody) Summary: cvs.get_dialect() return a class object Initial Comment: With python2.5 (and 2.6) cvs.get_dialect('excel') returns a Dialect class object in contrast to python 2.4 where an instance of csv.excel is returned, the former having only read only attributes. % python2.4 Python 2.4.1 (#3, Jul 28 2005, 22:08:40) [GCC 3.3 20030304 (Apple Computer, Inc. build 1671)] on darwin Type "help", "copyright", "credits" or "license" for more information. >>> import csv >>> d = csv.get_dialect("excel") >>> d % python Python 2.6a0 (trunk:54264M, Mar 10 2007, 15:19:48) [GCC 4.0.1 (Apple Computer, Inc. build 5367)] on darwin Type "help", "copyright", "credits" or "license" for more information. >>> import csv >>> d = csv.get_dialect("excel") >>> d <_csv.Dialect object at 0x137fac0> -- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1744580&group_id=5470 ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com