On Thu, Jan 13, 2005 at 04:41:45PM -0800, Robert Kern wrote:
Jeff Shannon wrote:
(Plus, if this format might be used for RNA sequences as well as DNA
sequences, you've got at least a fifth base to represent, which means
you need at least three bits per base, which means only two bases per
byte (or else base-encodings split across byte-boundaries).... That gets
ugly real fast.)
Not to mention all the IUPAC symbols for incompletely specified bases
(e.g. R = A or G).
http://www.chem.qmul.ac.uk/iubmb/misc/naseq.html
Or, for those of us working with proteins as well, all the single letter
codes for proteins:
http://www.chem.qmul.ac.uk/iupac/AminoAcid/A2021.html
lots more bits.
I have a db with approx 3 million proteins in it and would not want to be using
a pure python approach :)
Michael
--
http://mail.python.org/mailman/listinfo/python-list