On Thu, Jan 13, 2005 at 12:19:49AM +0100, Fredrik Lundh wrote: > Chris Lasher wrote: > > > Since the file I'm working with contains tens of thousands of these > > records, I believe I need to find a way to hash this file such that I > > can retrieve the respective sequence more quickly than I could by > > parsing through the file request-by-request. However, I'm very new to > > Python and am still very low on the learning curve for programming and > > algorithms in general; while I'm certain there are ubiquitous > > algorithms for this type of problem, I don't know what they are or > > where to look for them. So I turn to the gurus and accost you for help > > once again. :-) If you could help me figure out how to code a solution > > that won't be a resource whore, I'd be _very_ grateful. (I'd prefer to > > keep it in Python only, even though I know interaction with a > > relational database would provide the fastest method--the group I'm > > trying to write this for does not have access to a RDBMS.) > > keeping an index in memory might be reasonable. the following class > creates an index file by scanning the FASTA file, and uses the "marshal" > module to save it to disk. if the index file already exists, it's used as is. > to regenerate the index, just remove the index file, and run the program > again.
the problem caught my interest, and the way you used a non-mmaped file in a place where using mmap was pretty much obvious (IMVHO) bothered me enough to overcome my laziness. It didn't overcome it by *much*, mind you, so the following probably only works on python 2.3, on Linux, in Argentina, and with FASTA data that looks like the sample I was able to download. However, having said that, the sample I downloaded was one 46MiB file, and reading it in on my notebook was fast enough that I ripped out the saving/reloading of the index and just reindex it every time. Adding back the persistant index is trivial. [ time passes ] In fact, I just found a several-gigabyte fasta file, and I guess you'd want the index for that one; I put the code back in. And now I should really go to bed, because this is very interesting but won't pay the bills. import os, mmap, marshal from UserDict import DictMixin class FASTA(DictMixin): def __init__(self, filename): self.file = f = open(filename) fno = f.fileno() stat = os.fstat(fno) self.map = mmap.mmap(fno, stat.st_size, access=mmap.ACCESS_COPY) self.checkindex() def __getitem__(self, key): p0, pf = self.index[key] m = self.map return m[p0:pf] def keys(self): return self.index.keys() def __contains__(self, key): return self.index.__contains__(key) def checkindex(self): indexfile = self.file.name + ".idx" if os.path.exists(indexfile): # and os.stat(indexfile).st_mtime > os.stat(self.file.name).st_mtime: self.index = marshal.load(open(indexfile, "rb")) else: print 'generating index...' self.genindex() marshal.dump(self.index, open(indexfile, "wb")) print 'done.' def genindex(self): index = {} m = self.map last = None while 1: pos = m.find('>') if last is not None: index[last] = (index[last], pos) if pos == -1: break m.seek(pos) line = m.readline() pos = m.tell() # this is the bit that probably only works with FASTA # files like I was able to find on the 'net. sep = line.index(' ') if sep == -1: name = line[1:].strip() else: name = line[1:sep].strip() index[name] = pos last = name self.index = index db = FASTA("/home/john/tmp/uniprot_sprot.fasta") print db["104K_THEPA"] -- John Lenton ([EMAIL PROTECTED]) -- Random fortune: "Those who believe in astrology are living in houses with foundations of Silly Putty." - Dennis Rawlins, astronomer
signature.asc
Description: Digital signature
-- http://mail.python.org/mailman/listinfo/python-list