Alexis Gallagher wrote: > (I tried to post this yesterday but I think my ISP ate it. Apologies if > this is a double-post.) > > Is it possible to do very fast string processing in python? My > bioinformatics application needs to scan very large ASCII files (80GB+), > compare adjacent lines, and conditionally do some further processing. I > believe the disk i/o is the main bottleneck so for now that's what I'm > optimizing. What I have now is roughly as follows (on python 2.3.5). > > filehandle = open("data",'r',buffering=1000)
This buffer size seems, shall we say, unadventurous? It's likely to slow things down considerably, since the filesystem is probably going to naturally wnt to use a rather larger value. I'd suggest a 64k minumum. > > lastLine = filehandle.readline() > I'd suggest lastTokens = filehandle.readline().strip().split(delimiter) here. You have no need of the line other than to split it into tokens. > for currentLine in filehandle.readlines(): > Note that this is going to read the whole file in to (virtual) memory before entering the loop. I somehow suspect you'd rather avoid this if you could. I further suspect your testing has been with smaller files than 80GB ;-). You might want to consider for currentLine in filehandle: as an alternative. This uses the file's generator properties to produce the next line each time round the loop. > lastTokens = lastLine.strip().split(delimiter) The line above goes away if you adopt the loop initialization suggestion above. Otherwise you are repeating the splitting of each line twice, once as the current line then again as the last line. > currentTokens = currentLine.strip().split(delimiter) > > lastGeno = extract(lastTokens[0]) > currentGeno = extract(currentTokens[0]) > If the extract() operation is stateless (in other words if it always produces the same output for a given input) then again you are unecessarily repeating yourself here by running extract() on the same data as the current first token and the last first token (if you see what I mean). I might also observe that you seem to expect only two tokens per line. If this is invariable the case then you might want to consider writing an unpacking assignment instead, such as cToken0, cToken1, newline = currentLine.strip().split(delimiter) to save the indexing. Not a big deal, thugh, and it *will* break if you have more than one delimiter in a line, as the interpreter won;t then know what to do with the third and subsequent elements. > # prepare for next iteration > lastLine = currentLine > Of course now you are going to try and strip the delimiter off the line and split it again when you loop around again. You should now just be able to say lastTokens = currentTokens instead. > if lastGeno == currentGeno: > table.markEquivalent(int(lastTokens[1]),int(currentTokens[1])) > > So on every iteration I'm processing mutable strings -- this seems > wrong. What's the best way to speed this up? Can I switch to some fast > byte-oriented immutable string library? Are there optimizing compilers? > Are there better ways to prep the file handle? > I'm sorry but I am not sure where the mutable strings come in. Python strings are immutable anyway. Well-known for it. It might be a slight problem that you are creating a second terminator-less copy of each line, but that's probably inevitable. Of course you leave us in the dark about the nature of table.markEquivalent as well. Depending on the efficiency of the extract() operation you might want to consider short-circuiting the loop if the two tokens have already been marked as equivalent. That might be a big win or not depending on relative efficiency. > Perhaps this is a job for C, but I am of that soft generation which > fears memory management. I'd need to learn how to do buffered reading in > C, how to wrap the C in python, and how to let the C call back into > python to call markEquivalent(). It sounds painful. I _have_ done some > benchmark comparisons of only the underlying line-based file reading > against a Common Lisp version, but I doubt I'm using the optimal > construct in either language so I hesitate to trust my results, and > anyway the interlanguage bridge will be even more obscure in that case. > Probably the biggest gain will be in simply not reading the whole file into memory by calling its .readlines() method. Summarising. consider something more like: filehandle = open("data",'r',buffering=64*1024) # You could also try just leaving the buffering spec out lastTokens = filehandle.readline().strip().split(delim) lastGeno = extract(lastTokens[0]) for currentLine in filehandle: currentTokens = currentLine.strip().split(delim) currentGeno = extract(currentTokens[0]) if lastGeno == currentGeno: table.markEquivalent(int(lastTokens[1]), int(currentTokens[1])) lastGeno = currentGeno lastTokens = currentTokens > Much obliged for any help, > Alexis Hope this is. one last thought: the bioPython package will potentially save you huge amounts of time. They guys who developed and maintain it have done a lot of genome-slinging, and they appear to know what they are doing. They may have already solved several problems you are struggling with. regards Steve -- Steve Holden +44 150 684 7255 +1 800 494 3119 Holden Web LLC www.holdenweb.com PyCon TX 2006 www.python.org/pycon/ -- http://mail.python.org/mailman/listinfo/python-list