In article <[EMAIL PROTECTED]>, nuttydevil wrote: > I have many notepad documents that all contain long chunks of genetic > code. They look something like this: > > atggctaaactgaccaagcgcatgcgtgttatccgcgagaaagttgatgcaaccaaacag > tacgacatcaacgaagctatcgcactgctgaaagagctggcgactgctaaattcgtagaa > agcgtggacgtagctgttaacctcggcatcgacgctcgtaaatctgaccagaacgtacgt > ggtgcaactgtactgccgcacggtactggccgttccgttcgcgtagccgtatttacccaa > > Basically, I want to design a program using python that can open and > read these documents. However, I want them to be read 3 base pairs at a > time (to analyse them codon by codon) and find the value that each > codon has a value assigned to it. An example of this is below: > > ** If the three base pairs were UUU the value assigned to it (from the > codon value table) would be 0.296 > > The program has to read all the sequence three pairs at a time, then I > want to get all the values for each codon, multiply them together and > put them to the power of 1 / the length of the sequence in codons > (which is the length of the whole sequence divided by three). >
I don't really understand precisely what you're trying to do. First off, those aren't base pairs, they're bases. Only when you have double-stranded DNA (or RNA, or some other oddball cases) would they be base pairs. Second, I don't know what the codon to value function is, is this frequency (IE number n occurences of codon X out of N total codons)? Or is the lookup table provided for you? Anyay, I can help you with most of the preprocessing. For example, >However, to make things even more complicated, the notebook sequences > are in lowercase and the codon value table is in uppercase, so the > sequences need to be converted into uppercase. Also, the Ts in the DNA > sequences need to be changed to Us (again to match the codon value > table). And finally, before the DNA sequences are read and analysed I > need to remove the first 50 codons (i.e. the first 150 letters) and the > last 20 codons (the last 60 letters) from the DNA sequence. I've also > been having problems ensuring the program reads ALL the sequence 3 > letters at a time. So, if the file is called "notepad.txt", I'd do what you did above as: import string o = open("notepad.txt") l = o.readlines() ## read all lines l = map(string.strip, l) ## strip newlines l = "".join(l) ## join into one string (in case codon boundaries cross lines) l = l[50:-60] l = l.upper() print l codons = [] for i in range(0, len(l), 3): codons.append(l[i:i+3]) print codons That gets you about 30% of the way there. Dave -- http://mail.python.org/mailman/listinfo/python-list