(I tried to post this yesterday but I think my ISP ate it. Apologies if this is a double-post.)
Is it possible to do very fast string processing in python? My bioinformatics application needs to scan very large ASCII files (80GB+), compare adjacent lines, and conditionally do some further processing. I believe the disk i/o is the main bottleneck so for now that's what I'm optimizing. What I have now is roughly as follows (on python 2.3.5). filehandle = open("data",'r',buffering=1000) lastLine = filehandle.readline() for currentLine in filehandle.readlines(): lastTokens = lastLine.strip().split(delimiter) currentTokens = currentLine.strip().split(delimiter) lastGeno = extract(lastTokens[0]) currentGeno = extract(currentTokens[0]) # prepare for next iteration lastLine = currentLine if lastGeno == currentGeno: table.markEquivalent(int(lastTokens[1]),int(currentTokens[1])) So on every iteration I'm processing mutable strings -- this seems wrong. What's the best way to speed this up? Can I switch to some fast byte-oriented immutable string library? Are there optimizing compilers? Are there better ways to prep the file handle? Perhaps this is a job for C, but I am of that soft generation which fears memory management. I'd need to learn how to do buffered reading in C, how to wrap the C in python, and how to let the C call back into python to call markEquivalent(). It sounds painful. I _have_ done some benchmark comparisons of only the underlying line-based file reading against a Common Lisp version, but I doubt I'm using the optimal construct in either language so I hesitate to trust my results, and anyway the interlanguage bridge will be even more obscure in that case. Much obliged for any help, Alexis -- http://mail.python.org/mailman/listinfo/python-list