On 12 Jan 2005 14:46:07 -0800, "Chris Lasher" <[EMAIL PROTECTED]> wrote:
>Hello, >I have a rather large (100+ MB) FASTA file from which I need to >access records in a random order. The FASTA format is a standard format >for storing molecular biological sequences. Each record contains a >header line for describing the sequence that begins with a '>' >(right-angle bracket) followed by lines that contain the actual >sequence data. Three example FASTA records are below: > Others have probably solved your basic problem, or pointed the way. I'm just curious. Given that the information content is 2 bits per character that is taking up 8 bits of storage, there must be a good reason for storing and/or transmitting them this way? I.e., it it easy to think up a count-prefixed compressed format packing 4:1 in subsequent data bytes (except for the last byte which have less than 4 2-bit codes). I'm wondering how the data is actually used once records are retrieved. (but I'm too lazy to explore the biopython.org link). >>CW127_A01 >TGCAGTCGAACGAGAACGGTCCTTCGGGATGTCAGCTAAGTGGCGGACGGGTGAGTAATG >TATAGTTAATCTGCCCTTTAGAGGGGGATAACAGTTGGAAACGACTGCTAATACCCCATA >GCATTAAACAT >>CW127_A02 >TGCAGTCGAACGAGAACGGTCCTTCGGGATGTCAGCTAAGTGGCGGACGGGTGAGTAATG >TATAGTTAATCTGCCCTTTAGAGGGGGATAACAGTTGGAAACGACTGCTAATACCCCATA >GCATTAAACATTCCGCCTGGGGAGTACGGTCGCAAGATTAAAACTCAAAGGAATAGACGG >>CW127_A03 >TGCAGTCGAACGAGAACGGTCCTTCGGGATGTCAGCTAAGTGGCGGACGGGTGAGTAATG >TATAGTTAATCTGCCCTTTAGAGGGGGATAACAGTTGGAAACGACTGCTAATACCCCATA >GCATTAAACATTCCGCCTGGG >... Regards, Bengt Richter -- http://mail.python.org/mailman/listinfo/python-list