Alexis Gallagher wrote: > Steve, > > First, many thanks! > > Steve Holden wrote: >> Alexis Gallagher wrote: >>> >>> filehandle = open("data",'r',buffering=1000) >> >> This buffer size seems, shall we say, unadventurous? It's likely to >> slow things down considerably, since the filesystem is probably going >> to naturally wnt to use a rather larger value. I'd suggest a 64k minumum. > > Good to know. I should have dug into the docs deeper. Somehow I thought > it listed lines not bytes. > >>> for currentLine in filehandle.readlines(): >>> >> Note that this is going to read the whole file in to (virtual) memory >> before entering the loop. I somehow suspect you'd rather avoid this if >> you could. I further suspect your testing has been with smaller files >> than 80GB ;-). You might want to consider >> > > Oops! Thanks again. I thought that readlines() was the generator form, > based on the docstring comments about the deprecation of xreadlines(). > >>> So on every iteration I'm processing mutable strings -- this seems >>> wrong. What's the best way to speed this up? Can I switch to some >>> fast byte-oriented immutable string library? Are there optimizing >>> compilers? Are there better ways to prep the file handle? >>> >> I'm sorry but I am not sure where the mutable strings come in. Python >> strings are immutable anyway. Well-known for it. > > I misspoke. I think was mixing this up with the issue of object-creation > overhead for all of the string handling in general. Is this a bottleneck > to string processing in python, or is this a hangover from my Java days? > I would have thought that dumping the standard string processing > libraries in favor of byte manipulation would have been one of the > biggest wins. > >> Of course you leave us in the dark about the nature of >> table.markEquivalent as well. > > markEquivalent() implements union-join (aka, uptrees) to generate > equivalence classes. Optimising that was going to be my next task > > I feel a bit silly for missing the double-processing of everything. > Thanks for pointing that out. And I will check out the biopython package. > > I'm still curious if optimizing compilers are worth examining. For > instance, I saw Pyrex and Pysco mentioned on earlier threads. I'm > guessing that both this tokenizing and the uptree implementations sound > like good candidates for one of those tools, once I shake out these > algorithmic problems. > > > alexis
When your problem is I/O bound there is almost nothing that can be done to speed it up without some sort of refactoring of the input data itself. Python reads bytes off a hard drive just as fast as any compiled language. A good test is to copy the file and measure the time. You can't make your program run any faster than a copy of the file itself without making hardware changes (e.g. RAID arrays, etc.). You might also want to take a look at csv module. Reading lines and splitting on delimeters is almost always handled well by csv. -Larry Bates -- http://mail.python.org/mailman/listinfo/python-list