Bugs item #1636950, was opened at 2007-01-16 08:56
Message generated for change (Comment added) made by bcannon
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1636950&group_id=5470

Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: Python Library
Group: Python 2.5
>Status: Closed
>Resolution: Invalid
Priority: 5
Private: No
Submitted By: Andy Monthei (amonthei)
Assigned to: Nobody/Anonymous (nobody)
Summary: Newline skipped in "for line in file"

Initial Comment:
When processing huge fixed block files of about 7000 bytes wide and several 
hundred thousand lines long some pairs of lines get read as one long line with 
no line break when using "for line in file:".  The problem is even worse when 
using the fileinput module and reading in five or six huge files consisting of 
4.8 million records causes several hundred pairs of lines to be read as single 
lines. When a newline is skipped it is usually followed by several more in the 
next few hundred lines. I have not noticed any other characters being skipped, 
only the line break.

O.S. Windows (5, 1, 2600, 2, 'Service Pack 2')
Python 2.5

----------------------------------------------------------------------

>Comment By: Brett Cannon (bcannon)
Date: 2007-01-20 16:46

Message:
Logged In: YES 
user_id=357491
Originator: NO

Well, with Andy saying he can't reproduce the problem I am going to close
as invalid.

Andy, if you ever happen to be able to upload data that triggers it, then
please re-open this bug.

----------------------------------------------------------------------

Comment By: Andy Monthei (amonthei)
Date: 2007-01-20 14:53

Message:
Logged In: YES 
user_id=1693612
Originator: YES

I have had no luck creating random data to reproduce the problem which
leaves me to come to the conclusion that it was the data itself.  Using a
hex editor I find no problem with the line breaks.

The data that triggers this bug is transferred several time before it gets
to me. It originates on a Unix box, then goes to an IBM mainframe, then to
my Windows machine and through many updates along the way. It may be an
EBCDIC/ASCII conversion or possibly something to do with the mainframe to
PC transfer. Whatever it is, it's in the data itself.

The only thing that bothers me is that Java somehow is not affected by
this bad data.

----------------------------------------------------------------------

Comment By: Andy Monthei (amonthei)
Date: 2007-01-18 07:34

Message:
Logged In: YES 
user_id=1693612
Originator: YES

I am using open() for reading the file, no other features. I have also had
fileinput.input(fileList) compound the problem.  Each file that this has
happened to is a fixed block file of either 6990 or 7700 bytes wide but
this I think is insignificant. When looking at the file in a hex editor
everything looks fine and a small Java program using a buffered reader
will give me the correct line count when Python does not.

Using something like fp.read(8192) I'm sure might temporarily solve my
problem but I will keep working on getting a file I can upload.



----------------------------------------------------------------------

Comment By: Walter Dörwald (doerwalter)
Date: 2007-01-18 01:23

Message:
Logged In: YES 
user_id=89016
Originator: NO

Are you using any of the unicode reading features (i.e. codecs.EncodedFile
etc.) or are you using plain open() for reading the file?

----------------------------------------------------------------------

Comment By: Mark Roberts (mark-roberts)
Date: 2007-01-17 23:12

Message:
Logged In: YES 
user_id=1591633
Originator: NO

I don't know if this helps: I spent the last little while creating /
reading random files that all (seemingly) matched the description you gave
us.  None of these files failed to read properly.  (e.g., have the right
amount of rows with a line length that seemingly was the right line. 
Definitely no doubling lines).

Perusing the file source code found a detailed discussion of fgets vs
fgetc for finding the next line in the file.  Have you tried reading the
file with fp.read(8192) or similar?  Hopefully you're able to reproduce
the bug with scrubbed data (because I couldn't construct random data to do
so).  Good luck.

----------------------------------------------------------------------

Comment By: Mark Roberts (mark-roberts)
Date: 2007-01-17 21:24

Message:
Logged In: YES 
user_id=1591633
Originator: NO

How wide are the min and max widths of the lines?  This problem is of
particular interest to me.

----------------------------------------------------------------------

Comment By: Andy Monthei (amonthei)
Date: 2007-01-17 13:58

Message:
Logged In: YES 
user_id=1693612
Originator: YES

I can not upload the files that trigger this because of the data that is
in them but I am working on getting around that.

In my data line 617391 in a fixed block file of 6990 bytes wide gets read
in with the next line after it.  The line break is 0d0a (same as the
others) where the bug happens so I am wondering if it is a buffer issue
where the linebreak falls at the edge, however no other characters are
ever missed. The total file is 888420 lines and this happens in four
spots.

I will hopefully have a file to send soon.

----------------------------------------------------------------------

Comment By: Brett Cannon (bcannon)
Date: 2007-01-16 14:33

Message:
Logged In: YES 
user_id=357491
Originator: NO

Do you happen to have a sample you could upload that triggers the bug?

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1636950&group_id=5470
_______________________________________________
Python-bugs-list mailing list 
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

Reply via email to