Hey everyone, For the regular expression gurus... I'm trying to write a string matching algorithm for genomic sequences. I'm pulling out Genes from a large genomic pattern, with certain start and stop codons on either side. This is simple enough... for example:
start = AUG stop=AGG BBBBBBAUGWWWWWWAGGBBBBBB So I obviously want to pull out AUGWWWWWWAGG (and all other matches). This works great with my current regular expression. The problem, however, is that codons come in sets of 3 bases. So there are actually three different 'frames' I could be using. For example: ABCDEFGHIJ I could have ABC DEF GHI or BCD EFG HIJ or CDE FGH IJx.... etc. So finally, my question. How can I represent this in a regular expression? :) This is what I'd like to do: (Find all groups of any three characters) (Find a start codon) (find any other codons) (Find an end codon) Is this possible? It seems that I'd want to do something like this: (\w \w\w)+(AUG)(\s)(AGG)(\s)* - where \w\w\w matches EXACTLY all sets of three non-whitespace characters, followed by AUG \s AGG, and then anything else. I hope I am making sense. Obviously, however, this will make sure that ANY set of three characters exist before a start codon. Is there a way to match exactly, to say something like 'Find all sets of three, then AUG and AGG, etc.'. This way, I could scan for genes, remove the first letter, scan for more genes, remove the first letter again, and scan for more genes. This would hypothetically yield different genes, since the frame would be shifted. This might be a lot of information... I appreciate any insight. Thank you! Blaine -- http://mail.python.org/mailman/listinfo/python-list