Bugs item #1451466, was opened at 2006-03-16 09:21 Message generated for change (Comment added) made by josiahcarlson You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1451466&group_id=5470
Please note that this message will contain a full copy of the comment thread, including the initial issue submission, for this request, not just the latest update. Category: Python Interpreter Core Group: Python 2.4 Status: Open Resolution: None Priority: 5 Submitted By: christen (richardchristen) Assigned to: Nobody/Anonymous (nobody) Summary: reading very large files Initial Comment: I work on the human genome I extracted words from chromosomes using a suffix tree (C compiled for 64 done on a SUN with 300 Go RAM, since my suffix tree requires 150 Go RAM for chromosome 1, the largest one) this gave some >5 Go files, for example with 163763326 lines for chr 4, the one presently analyzed. Using python 2.4.2 on a windows 32-computer (1.5 Go RAM), reading this file line by line either for li in file: do something or while li!='': li=file.readline() I got problems seemingly around the 4 Go boundary (after reading the problematic first line), for some lines (not all), the li returned the correct content but with the first word of the next line also within li (see below) As a result a simple file1=open('1') file2=open('2','w') li=file1.readline() while li!='': file2.write(li) li=file1.readline() produced a second file of only 163754385 lines problem lines were "seemingly random", i.e. not in a row, with the last line being OK. The same code on the same file but on my OSX 64-dualcore machine went fine, despite the use of default Python 2.2.3 and "file Python" showing it is a Mach-0 executable ppc, i.e. a 32 bit app. Everything was run from the command line. the first file looks like that ... TCAGCCACAGCAGAAAGTGA:\t33240 551212 751185 TCAGCCACAGCAGAAAGTGC:\t131324047 TCAGCCACAGCACTGTGTTA:\t61641912 .... the second file contains lines like these : TCAGCCACAGCAGAAAGTGC:\t131324047TCAGCCACAGCAGAAGAAGA: which is 'first line'+'1rst word of next line' PS1 : no problem to read the big file with UEdit on the windows machine. Therefore the OS itself is not the problem (also I transfered the bigfile from the Windows to the Mac, if the file had had problems, it would have been corrupted on the Mac) PS2 : I tried python 2.3.5 on windows with the same problem. PS3: If needed, I can run the same test on a similar file but for chromosome 8 which is slightly below the 4 Go limit (3.99). PS4: I think I remember having done a similar parsing on a Linux Athlon 64 monoCPU a month ago, with no trouble. ---------------------------------------------------------------------- Comment By: Josiah Carlson (josiahcarlson) Date: 2006-03-17 16:35 Message: Logged In: YES user_id=341410 Sounds like an issue with file objects on certain platforms not being able to handle offsets of 2**32 or larger. I personally have read and written files > 4gb on the windows platform, but I seem to recall having issues on 32 bit linux some time in the past. ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1451466&group_id=5470 _______________________________________________ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com