fast text processing

2006-02-21 Thread Alexis Gallagher
(I tried to post this yesterday but I think my ISP ate it. Apologies if 
this is a double-post.)

Is it possible to do very fast string processing in python? My 
bioinformatics application needs to scan very large ASCII files (80GB+), 
compare adjacent lines, and conditionally do some further processing. I 
believe the disk i/o is the main bottleneck so for now that's what I'm 
optimizing. What I have now is roughly as follows (on python 2.3.5).

filehandle = open("data",'r',buffering=1000)

lastLine = filehandle.readline()

for currentLine in filehandle.readlines():

 lastTokens = lastLine.strip().split(delimiter)
 currentTokens = currentLine.strip().split(delimiter)

 lastGeno = extract(lastTokens[0])
 currentGeno = extract(currentTokens[0])

 # prepare for next iteration
 lastLine = currentLine

 if lastGeno == currentGeno:
table.markEquivalent(int(lastTokens[1]),int(currentTokens[1]))

So on every iteration I'm processing mutable strings -- this seems 
wrong. What's the best way to speed this up? Can I switch to some fast 
byte-oriented immutable string library? Are there optimizing compilers? 
Are there better ways to prep the file handle?

Perhaps this is a job for C, but I am of that soft generation which 
fears memory management. I'd need to learn how to do buffered reading in 
C, how to wrap the C in python, and how to let the C call back into 
python to call markEquivalent(). It sounds painful. I _have_ done some 
benchmark comparisons of only the underlying line-based file reading 
against a Common Lisp version, but I doubt I'm using the optimal 
construct in either language so I hesitate to trust my results, and 
anyway the interlanguage bridge will be even more obscure in that case.

Much obliged for any help,
Alexis
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: fast text processing

2006-02-21 Thread Alexis Gallagher
Steve,

First, many thanks!

Steve Holden wrote:
> Alexis Gallagher wrote:
>>
>> filehandle = open("data",'r',buffering=1000)
> 
> This buffer size seems, shall we say, unadventurous? It's likely to slow 
> things down considerably, since the filesystem is probably going to 
> naturally wnt to use a rather larger value. I'd suggest a 64k minumum.

Good to know. I should have dug into the docs deeper. Somehow I thought 
it listed lines not bytes.

>> for currentLine in filehandle.readlines():
>>
> Note that this is going to read the whole file in to (virtual) memory 
> before entering the loop. I somehow suspect you'd rather avoid this if 
> you could. I further suspect your testing has been with smaller files 
> than 80GB ;-). You might want to consider
> 

Oops! Thanks again. I thought that readlines() was the generator form, 
based on the docstring comments about the deprecation of xreadlines().

>> So on every iteration I'm processing mutable strings -- this seems 
>> wrong. What's the best way to speed this up? Can I switch to some fast 
>> byte-oriented immutable string library? Are there optimizing 
>> compilers? Are there better ways to prep the file handle?
>>
> I'm sorry but I am not sure where the mutable strings come in. Python 
> strings are immutable anyway. Well-known for it.

I misspoke. I think was mixing this up with the issue of object-creation 
overhead for all of the string handling in general. Is this a bottleneck 
to string processing in python, or is this a hangover from my Java days? 
I would have thought that dumping the standard string processing 
libraries in favor of byte manipulation would have been one of the 
biggest wins.

> Of course you leave us in the dark about the nature of 
> table.markEquivalent as well.

markEquivalent() implements union-join (aka, uptrees) to generate 
equivalence classes. Optimising that was going to be my next task

I feel a bit silly for missing the double-processing of everything. 
Thanks for pointing that out. And I will check out the biopython package.

I'm still curious if optimizing compilers are worth examining. For 
instance, I saw Pyrex and Pysco mentioned on earlier threads. I'm 
guessing that both this tokenizing and the uptree implementations sound 
like good candidates for one of those tools, once I shake out these 
algorithmic problems.


alexis
-- 
http://mail.python.org/mailman/listinfo/python-list