Chris Lasher wrote: > Since the file I'm working with contains tens of thousands of these > records, I believe I need to find a way to hash this file such that I > can retrieve the respective sequence more quickly than I could by > parsing through the file request-by-request. However, I'm very new to > Python and am still very low on the learning curve for programming and > algorithms in general; while I'm certain there are ubiquitous > algorithms for this type of problem, I don't know what they are or > where to look for them. So I turn to the gurus and accost you for help > once again. :-) If you could help me figure out how to code a solution > that won't be a resource whore, I'd be _very_ grateful. (I'd prefer to > keep it in Python only, even though I know interaction with a > relational database would provide the fastest method--the group I'm > trying to write this for does not have access to a RDBMS.)
keeping an index in memory might be reasonable. the following class creates an index file by scanning the FASTA file, and uses the "marshal" module to save it to disk. if the index file already exists, it's used as is. to regenerate the index, just remove the index file, and run the program again. import os, marshal class FASTA: def __init__(self, file): self.file = open(file) self.checkindex() def __getitem__(self, key): try: pos = self.index[key] except KeyError: raise IndexError("no such item") else: f = self.file f.seek(pos) header = f.readline() assert ">" + header + "\n" data = [] while 1: line = f.readline() if not line or line[0] == ">": break data.append(line) return data def checkindex(self): indexfile = self.file.name + ".index" try: self.index = marshal.load(open(indexfile, "rb")) except IOError: print "building index..." index = {} # scan the file f = self.file f.seek(0) while 1: pos = f.tell() line = f.readline() if not line: break if line[0] == ">": # save offset to header line header = line[1:].strip() index[header] = pos # save index to disk f = open(indexfile, "wb") marshal.dump(index, f) f.close() self.index = index db = FASTA("myfastafile.dat") print db["CW127_A02"] ['TGCAGTCGAACGAGAACGGTCCTTCGGGATGTCAGCTAAG... tweak as necessary. </F> -- http://mail.python.org/mailman/listinfo/python-list