You don't say how this will be used, but here goes:

1) Read the records and put into dictionary with key
of sequence (from header) and data being the sequence
data.  Use shelve to store the dictionary for subsequent
runs (if load time is excessive).

2) Take a look at Gadfly (gadfly.sourceforge.net).  It
provides you with Python SQL-like database and may be
better solution if data is basically static and you
do lots of processing.

All depends on how you use the data.

Regards,
Larry Bates
Syscon, Inc.

Chris Lasher wrote:
Hello,
I have a rather large (100+ MB) FASTA file from which I need to
access records in a random order. The FASTA format is a standard format
for storing molecular biological sequences. Each record contains a
header line for describing the sequence that begins with a '>'
(right-angle bracket) followed by lines that contain the actual
sequence data. Three example FASTA records are below:


CW127_A01

TGCAGTCGAACGAGAACGGTCCTTCGGGATGTCAGCTAAGTGGCGGACGGGTGAGTAATG TATAGTTAATCTGCCCTTTAGAGGGGGATAACAGTTGGAAACGACTGCTAATACCCCATA GCATTAAACAT

CW127_A02

TGCAGTCGAACGAGAACGGTCCTTCGGGATGTCAGCTAAGTGGCGGACGGGTGAGTAATG TATAGTTAATCTGCCCTTTAGAGGGGGATAACAGTTGGAAACGACTGCTAATACCCCATA GCATTAAACATTCCGCCTGGGGAGTACGGTCGCAAGATTAAAACTCAAAGGAATAGACGG

CW127_A03

TGCAGTCGAACGAGAACGGTCCTTCGGGATGTCAGCTAAGTGGCGGACGGGTGAGTAATG TATAGTTAATCTGCCCTTTAGAGGGGGATAACAGTTGGAAACGACTGCTAATACCCCATA GCATTAAACATTCCGCCTGGG ...

Since the file I'm working with contains tens of thousands of these
records, I believe I need to find a way to hash this file such that I
can retrieve the respective sequence more quickly than I could by
parsing through the file request-by-request. However, I'm very new to
Python and am still very low on the learning curve for programming and
algorithms in general; while I'm certain there are ubiquitous
algorithms for this type of problem, I don't know what they are or
where to look for them. So I turn to the gurus and accost you for help
once again. :-) If you could help me figure out how to code a solution
that won't be a resource whore, I'd be _very_ grateful. (I'd prefer to
keep it in Python only, even though I know interaction with a
relational database would provide the fastest method--the group I'm
trying to write this for does not have access to a RDBMS.)
Thanks very much in advance,
Chris

--
http://mail.python.org/mailman/listinfo/python-list

Reply via email to