Gary Herron wrote:
[EMAIL PROTECTED] wrote:
On Jun 2, 2:08 am, "kalakouentin" <[EMAIL PROTECTED]> wrote:

 Do you know a way to actually load my data in a more
"batch-like" way so I will avoid the constant line by line reading?

If your files will fit in memory, you can just do

text = file.readlines()

and Python will read the entire file into a list of strings named
'text,' where each item in the list corresponds to one 'line' of the
file.

No that won't help. That has to do *all* the same work (reading blocks and finding line endings) as the iterator PLUS allocate and build a list.
Better to just use the iterator.

for line in file:
 ...

Actually this *can* be much slower. Suppose I want to search a file to see if a substring is present.

st = "some substring that is not actually in the file"
f = <50 MB log file>

Method 1:

for i in file(f):
    if st in i:
        break

--> 0.472416 seconds

Method 2:

Read whole file:

fh = file(f)
rl = fh.read()
fh.close()

--> 0.098834 seconds

"st in rl" test --> 0.037251 (total: .136 seconds)

Method 3:

mmap the file:

mm = mmap.mmap(fh.fileno(), 0, mmap.MAP_SHARED, mmap.PROT_READ)
"st in mm" test --> 3.589938 (<-- see my post the other day)

mm.find(st) --> 0.186895

Summary:

If you can afford the memory, it can be more efficient (more than 3 times faster in this example) to read the file into memory and process it at once (if possible).

Mmapping the file and processing it at once is roughly as fast (I didnt measure the difference carefully), but has the advantage that if there are parts of the file you do not touch you don't fault them into memory. You could also play more games and mmap chunks at a time to limit the memory use (but you'd have to be careful with mmapping that doesn't match record boundaries).

Kris
--
http://mail.python.org/mailman/listinfo/python-list

Reply via email to