Bugs item #1636950, was opened at 2007-01-16 10:56 Message generated for change (Comment added) made by amonthei You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1636950&group_id=5470
Please note that this message will contain a full copy of the comment thread, including the initial issue submission, for this request, not just the latest update. Category: Python Library Group: Python 2.5 Status: Open Resolution: None Priority: 5 Private: No Submitted By: Andy Monthei (amonthei) Assigned to: Nobody/Anonymous (nobody) Summary: Newline skipped in "for line in file" Initial Comment: When processing huge fixed block files of about 7000 bytes wide and several hundred thousand lines long some pairs of lines get read as one long line with no line break when using "for line in file:". The problem is even worse when using the fileinput module and reading in five or six huge files consisting of 4.8 million records causes several hundred pairs of lines to be read as single lines. When a newline is skipped it is usually followed by several more in the next few hundred lines. I have not noticed any other characters being skipped, only the line break. O.S. Windows (5, 1, 2600, 2, 'Service Pack 2') Python 2.5 ---------------------------------------------------------------------- >Comment By: Andy Monthei (amonthei) Date: 2007-01-20 16:53 Message: Logged In: YES user_id=1693612 Originator: YES I have had no luck creating random data to reproduce the problem which leaves me to come to the conclusion that it was the data itself. Using a hex editor I find no problem with the line breaks. The data that triggers this bug is transferred several time before it gets to me. It originates on a Unix box, then goes to an IBM mainframe, then to my Windows machine and through many updates along the way. It may be an EBCDIC/ASCII conversion or possibly something to do with the mainframe to PC transfer. Whatever it is, it's in the data itself. The only thing that bothers me is that Java somehow is not affected by this bad data. ---------------------------------------------------------------------- Comment By: Andy Monthei (amonthei) Date: 2007-01-18 09:34 Message: Logged In: YES user_id=1693612 Originator: YES I am using open() for reading the file, no other features. I have also had fileinput.input(fileList) compound the problem. Each file that this has happened to is a fixed block file of either 6990 or 7700 bytes wide but this I think is insignificant. When looking at the file in a hex editor everything looks fine and a small Java program using a buffered reader will give me the correct line count when Python does not. Using something like fp.read(8192) I'm sure might temporarily solve my problem but I will keep working on getting a file I can upload. ---------------------------------------------------------------------- Comment By: Walter Dörwald (doerwalter) Date: 2007-01-18 03:23 Message: Logged In: YES user_id=89016 Originator: NO Are you using any of the unicode reading features (i.e. codecs.EncodedFile etc.) or are you using plain open() for reading the file? ---------------------------------------------------------------------- Comment By: Mark Roberts (mark-roberts) Date: 2007-01-18 01:12 Message: Logged In: YES user_id=1591633 Originator: NO I don't know if this helps: I spent the last little while creating / reading random files that all (seemingly) matched the description you gave us. None of these files failed to read properly. (e.g., have the right amount of rows with a line length that seemingly was the right line. Definitely no doubling lines). Perusing the file source code found a detailed discussion of fgets vs fgetc for finding the next line in the file. Have you tried reading the file with fp.read(8192) or similar? Hopefully you're able to reproduce the bug with scrubbed data (because I couldn't construct random data to do so). Good luck. ---------------------------------------------------------------------- Comment By: Mark Roberts (mark-roberts) Date: 2007-01-17 23:24 Message: Logged In: YES user_id=1591633 Originator: NO How wide are the min and max widths of the lines? This problem is of particular interest to me. ---------------------------------------------------------------------- Comment By: Andy Monthei (amonthei) Date: 2007-01-17 15:58 Message: Logged In: YES user_id=1693612 Originator: YES I can not upload the files that trigger this because of the data that is in them but I am working on getting around that. In my data line 617391 in a fixed block file of 6990 bytes wide gets read in with the next line after it. The line break is 0d0a (same as the others) where the bug happens so I am wondering if it is a buffer issue where the linebreak falls at the edge, however no other characters are ever missed. The total file is 888420 lines and this happens in four spots. I will hopefully have a file to send soon. ---------------------------------------------------------------------- Comment By: Brett Cannon (bcannon) Date: 2007-01-16 16:33 Message: Logged In: YES user_id=357491 Originator: NO Do you happen to have a sample you could upload that triggers the bug? ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1636950&group_id=5470 _______________________________________________ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com