Re: What strategy for random accession of records in massive FASTA file?

Neil Benn Fri, 14 Jan 2005 01:25:36 -0800

Jeff Shannon wrote:

Chris Lasher wrote:
And besides, for long-term archiving purposes, I'd expect that zip et
al on a character-stream would provide significantly better
compression than a 4:1 packed format, and that zipping the packed
format wouldn't be all that much more efficient than zipping the
character stream.
This 105MB FASTA file is 8.3 MB gzip-ed.
And a 4:1 packed-format file would be ~26MB. It'd be interesting to see how that packed-format file would compress, but I don't care enough to write a script to convert the FASTA file into a packed-format file to experiment with... ;)

Short version, then, is that yes, size concerns (such as they may be) are outweighed by speed and conceptual simplicity (i.e. avoiding a huge mess of bit-masking every time a single base needs to be examined, or a human-(semi-)readable display is needed).

(Plus, if this format might be used for RNA sequences as well as DNA sequences, you've got at least a fifth base to represent, which means you need at least three bits per base, which means only two bases per byte (or else base-encodings split across byte-boundaries).... That gets ugly real fast.)
Jeff Shannon
Technician/Programmer
Credit International

Hello,

            Just to clear up a few things on the topic :

   If the file denotes DNA sequences there are five basic identifiers

AGCT and X (where X means 'dunno!').

If the files denoites RNA sequences, you will still only need five basic indentifiers the issue is that the T is replaced by a U.

One very good way I have found to parse large files of this nature (I've done it with many a use case) is to write a sax parser for the file. Therefore you can register a content handler, receive events from the sax parser and do whatever you like with it. Basically, using the sax framework to read the files - if your write the sax parser carefully then you stream the files and remove old lines from memory, therefore you have a scalable solution (rather than keeping everything in memory).

As an aside, I would seriously consider parsing your files and putting this information in a small local db - it's really not much work to do and the 'pure' python thing is a misnomer, whichever persistence mechanism you use (file,DB,etching it on the floor with a small robot accepting logo commands,etc) is unlikely to be pure python.

The advantage of putting it in a DB will show up later when you have fast and powerful retrieval capability.

Cheers,

Neil

--

Neil Benn
Senior Automation Engineer
Cenix BioScience
BioInnovations Zentrum
Tatzberg 47
D-01307
Dresden
Germany

Tel : +49 (0)351 4173 154
e-mail : [EMAIL PROTECTED]
Cenix Website : http://www.cenix-bioscience.com

--
http://mail.python.org/mailman/listinfo/python-list

Re: What strategy for random accession of records in massive FASTA file?

Reply via email to