You don't say how this will be used, but here goes:
1) Read the records and put into dictionary with key of sequence (from header) and data being the sequence data. Use shelve to store the dictionary for subsequent runs (if load time is excessive).
2) Take a look at Gadfly (gadfly.sourceforge.net). It provides you with Python SQL-like database and may be better solution if data is basically static and you do lots of processing.
All depends on how you use the data.
Regards, Larry Bates Syscon, Inc.
Chris Lasher wrote:
Hello, I have a rather large (100+ MB) FASTA file from which I need to access records in a random order. The FASTA format is a standard format for storing molecular biological sequences. Each record contains a header line for describing the sequence that begins with a '>' (right-angle bracket) followed by lines that contain the actual sequence data. Three example FASTA records are below:
CW127_A01
TGCAGTCGAACGAGAACGGTCCTTCGGGATGTCAGCTAAGTGGCGGACGGGTGAGTAATG TATAGTTAATCTGCCCTTTAGAGGGGGATAACAGTTGGAAACGACTGCTAATACCCCATA GCATTAAACAT
CW127_A02
TGCAGTCGAACGAGAACGGTCCTTCGGGATGTCAGCTAAGTGGCGGACGGGTGAGTAATG TATAGTTAATCTGCCCTTTAGAGGGGGATAACAGTTGGAAACGACTGCTAATACCCCATA GCATTAAACATTCCGCCTGGGGAGTACGGTCGCAAGATTAAAACTCAAAGGAATAGACGG
CW127_A03
TGCAGTCGAACGAGAACGGTCCTTCGGGATGTCAGCTAAGTGGCGGACGGGTGAGTAATG TATAGTTAATCTGCCCTTTAGAGGGGGATAACAGTTGGAAACGACTGCTAATACCCCATA GCATTAAACATTCCGCCTGGG ...
Since the file I'm working with contains tens of thousands of these records, I believe I need to find a way to hash this file such that I can retrieve the respective sequence more quickly than I could by parsing through the file request-by-request. However, I'm very new to Python and am still very low on the learning curve for programming and algorithms in general; while I'm certain there are ubiquitous algorithms for this type of problem, I don't know what they are or where to look for them. So I turn to the gurus and accost you for help once again. :-) If you could help me figure out how to code a solution that won't be a resource whore, I'd be _very_ grateful. (I'd prefer to keep it in Python only, even though I know interaction with a relational database would provide the fastest method--the group I'm trying to write this for does not have access to a RDBMS.) Thanks very much in advance, Chris
-- http://mail.python.org/mailman/listinfo/python-list