On Apr 27, 8:31 pm, blaine <[EMAIL PROTECTED]> wrote: > Hey everyone, > For the regular expression gurus... > > I'm trying to write a string matching algorithm for genomic > sequences. I'm pulling out Genes from a large genomic pattern, with > certain start and stop codons on either side. This is simple > enough... for example: > > start = AUG stop=AGG > BBBBBBAUGWWWWWWAGGBBBBBB > > So I obviously want to pull out AUGWWWWWWAGG (and all other matches). > This works great with my current regular expression. > > The problem, however, is that codons come in sets of 3 bases. So > there are actually three different 'frames' I could be using. For > example: > ABCDEFGHIJ > I could have ABC DEF GHI or BCD EFG HIJ or CDE FGH IJx.... etc. > > So finally, my question. How can I represent this in a regular > expression? :) This is what I'd like to do: > (Find all groups of any three characters) (Find a start codon) (find > any other codons) (Find an end codon) > > Is this possible? It seems that I'd want to do something like this: (\w > \w\w)+(AUG)(\s)(AGG)(\s)* - where \w\w\w matches EXACTLY all sets of > three non-whitespace characters, followed by AUG \s AGG, and then > anything else. I hope I am making sense. Obviously, however, this > will make sure that ANY set of three characters exist before a start > codon. Is there a way to match exactly, to say something like 'Find > all sets of three, then AUG and AGG, etc.'. This way, I could scan > for genes, remove the first letter, scan for more genes, remove the > first letter again, and scan for more genes. This would > hypothetically yield different genes, since the frame would be > shifted. > > This might be a lot of information... I appreciate any insight. Thank > you! > Blaine
Here's one idea (untested): s= { } for x in range( len( genes )- 3 ): s[ x ]= genes[ x: x+ 3 ] You might like Python's 'string slicing' feature. -- http://mail.python.org/mailman/listinfo/python-list