christen added the comment:
Hi Guido
It is not the end of the file that is not read (see also below)
I found about that about one year ago when I was parsing very large
files resulting from "blast" on the human genome
My parser chock after 4 Go, well before the end of the file : one line
was missing and my acc=li[x:y] end up with an error, because acc was
never filled...
This was kind of strange because this had not happened before with my
Linux box.
I opened the file (which I had created myself) with a editor that could
show hexa code : the proper line was there and allright.
If I remember well, I modified my code to see better what was going on :
in fact the missing line had been concateneted to the previous line
despite the proper existence of the end of line (hexa code was ok). see
also below
I forgot about that because nobody replied to my mails, and I thought it
was possibly related with windows 32 . I moved to a windows 64 recently
(windows has the best driver for SQL databases) and forgot about the bug
until I again ran into it. I then decided to try python 3k, it reads
>4Go file with no trouble but is so so slow, both in reading and
writing files.
The following code produces either <4Go or >4Go files depending upon
which fichout.write is commented
They both have the same line numbers, but the >4Go does not read
completely under windows (32 or 64)
I have no such pb on Linux or BSD (Mac).
python 3k on windows read both files ok, but is very very slow (change
xrange to range , I guess it is preposterous to advice you about that :-).
best
Richard
import sys
print(sys.version_info)
import time
print (time.strftime('%Y-%m-%d %H:%M:%S'))
liste=[]
start = time.time()
fichout=open('test.txt','w')
for i in xrange(85014961):
if i%5000000==0 and i>0:
print (i,time.time()-start)
fichout.write(str(i)+' '*59+'\n') #big file
#fichout.write(str(i)+'\n') #small file, same number of lines
fishout.flush()
fichout.close()
print ('total lines written ',i)
print (i,time.time()-start)
print ('*'*50)
fichin=open('test.txt')
start3 = time.time()
for i,li in enumerate(fichin):
if i%5000000==0 and i>0:
print (i,time.time()-start3)
fichin.close()
print ('total lines read ',i)
print(time.time()-start)
> Richard, can you somehow view the end of the file to see what its last
> lines actually are? It should end like this:
>
> 85014951
> 85014952
> 85014953
> 85014954
> 85014955
> 85014956
> 85014957
> 85014958
> 85014959
> 85014960
>
>
using a text editor reads:
85014944
85014945
85014946
85014947
85014948
85014949
85014950
85014951
85014952
85014953
85014954
85014955
85014956
85014957
85014958
85014959
85014960
windows py 2.5, with
if i>85014940:
print i, li.strip()
prints :
(2, 5, 0, 'final', 0)
2007-09-11 07:58:47
(5000000, 2.6720001697540283)
(10000000, 5.375)
(15000000, 8.0320000648498535)
(20000000, 10.703000068664551)
(25000000, 13.375)
(30000000, 16.047000169754028)
(35000000, 18.703000068664551)
(40000000, 21.360000133514404)
(45000000, 24.032000064849854)
(50000000, 26.687999963760376)
(55000000, 29.360000133514404)
(60000000, 32.032000064849854)
(65000000, 34.703000068664551)
(70000000, 37.407000064849854)
(75000000, 40.094000101089478)
(80000000, 42.797000169754028)
(85000000, 45.485000133514404)
85014941 85014951
85014942 85014952
85014943 85014953
85014944 85014954
85014945 85014955
85014946 85014956
85014947 85014957
85014948 85014958
85014949 85014959
85014950 85014960
==> missing lines are from within the file
now introduce in the loop: if len(li)>80: print li.strip()
(2, 5, 0, 'final', 0)
2007-09-11 08:08:16
(5000000, 3.1559998989105225)
(10000000, 6.3280000686645508)
(15000000, 9.4839999675750732)
(20000000, 12.655999898910522)
(25000000, 15.843999862670898)
(30000000, 19.016000032424927)
(35000000, 22.187999963760376)
(40000000, 25.358999967575073)
(45000000, 28.530999898910522)
(50000000, 31.703000068664551)
(55000000, 34.858999967575073)
(60000000, 38.030999898910522)
* 62410138
62410139 *
* 62414887
62414888 *
* 62415540
62415541 *
* 62420289
62420290 *
* 62420942
62420943 *
* 62421595
62421596 *
* 62422248
62422249 *
* 62422901
62422902 *
* 62427650
62427651 *
* 62428303
62428304 *
(65000000, 41.233999967575073)
(70000000, 44.437999963760376)
(75000000, 47.625)
(80000000, 50.828000068664551)
(85000000, 54.016000032424927)
('total lines read ', 85014950)
54.0309998989
==> end of line not read for 10 lines in the middle of the file ! NTFS
file system
best
Richard
__________________________________
Tracker <[EMAIL PROTECTED]>
<http://bugs.python.org/issue1142>
__________________________________
begin:vcard
fn:Richard Christen
n:Christen;Richard
org;quoted-printable:CNRS UMR 6543 & Universit=C3=A9 de Nice;Laboratoire de Biologie Virtuelle
adr:Parc Valrose;;Centre de Biochimie;Nice;;06108;France
email;internet:[EMAIL PROTECTED]
title;quoted-printable:Champion de saut en =C3=A9paisseur
tel;work:33- 492 076 947
url:http://bioinfo.unice.fr
version:2.1
end:vcard
_______________________________________________
Python-bugs-list mailing list
Unsubscribe:
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com