Hello, I have a rather large (100+ MB) FASTA file from which I need to access records in a random order. The FASTA format is a standard format for storing molecular biological sequences. Each record contains a header line for describing the sequence that begins with a '>' (right-angle bracket) followed by lines that contain the actual sequence data. Three example FASTA records are below:
>CW127_A01 TGCAGTCGAACGAGAACGGTCCTTCGGGATGTCAGCTAAGTGGCGGACGGGTGAGTAATG TATAGTTAATCTGCCCTTTAGAGGGGGATAACAGTTGGAAACGACTGCTAATACCCCATA GCATTAAACAT >CW127_A02 TGCAGTCGAACGAGAACGGTCCTTCGGGATGTCAGCTAAGTGGCGGACGGGTGAGTAATG TATAGTTAATCTGCCCTTTAGAGGGGGATAACAGTTGGAAACGACTGCTAATACCCCATA GCATTAAACATTCCGCCTGGGGAGTACGGTCGCAAGATTAAAACTCAAAGGAATAGACGG >CW127_A03 TGCAGTCGAACGAGAACGGTCCTTCGGGATGTCAGCTAAGTGGCGGACGGGTGAGTAATG TATAGTTAATCTGCCCTTTAGAGGGGGATAACAGTTGGAAACGACTGCTAATACCCCATA GCATTAAACATTCCGCCTGGG ... Since the file I'm working with contains tens of thousands of these records, I believe I need to find a way to hash this file such that I can retrieve the respective sequence more quickly than I could by parsing through the file request-by-request. However, I'm very new to Python and am still very low on the learning curve for programming and algorithms in general; while I'm certain there are ubiquitous algorithms for this type of problem, I don't know what they are or where to look for them. So I turn to the gurus and accost you for help once again. :-) If you could help me figure out how to code a solution that won't be a resource whore, I'd be _very_ grateful. (I'd prefer to keep it in Python only, even though I know interaction with a relational database would provide the fastest method--the group I'm trying to write this for does not have access to a RDBMS.) Thanks very much in advance, Chris -- http://mail.python.org/mailman/listinfo/python-list