On Apr 28, 6:30 am, Nick Craig-Wood <[EMAIL PROTECTED]> wrote: > blaine <[EMAIL PROTECTED]> wrote: > > I'm trying to write a string matching algorithm for genomic > > sequences. I'm pulling out Genes from a large genomic pattern, with > > certain start and stop codons on either side. This is simple > > enough... for example: > > > start = AUG stop=AGG > > BBBBBBAUGWWWWWWAGGBBBBBB > > > So I obviously want to pull out AUGWWWWWWAGG (and all other matches). > > This works great with my current regular expression. > > > The problem, however, is that codons come in sets of 3 bases. So > > there are actually three different 'frames' I could be using. For > > example: > > ABCDEFGHIJ > > I could have ABC DEF GHI or BCD EFG HIJ or CDE FGH IJx.... etc. > > > So finally, my question. How can I represent this in a regular > > expression? :) This is what I'd like to do: > > (Find all groups of any three characters) (Find a start codon) (find > > any other codons) (Find an end codon) > > > Is this possible? It seems that I'd want to do something like this: (\w > > \w\w)+(AUG)(\s)(AGG)(\s)* - where \w\w\w matches EXACTLY all sets of > > three non-whitespace characters, followed by AUG \s AGG, and then > > anything else. > > I'm not sure what the \s are doing in there - there doesn't appear to > be any whitespace in your examples. > > > I hope I am making sense. Obviously, however, this will make sure > > that ANY set of three characters exist before a start codon. Is > > there a way to match exactly, to say something like 'Find all sets > > of three, then AUG and AGG, etc.'. > > I think you want > > ^(\w\w\w)*(AUG)((\w\w\w)*?)(AGG) > > which will match up 0 or more triples, match AUG match 0 or more triples > then AGG. The ? makes it a minimum match otherwise you'll match more > than you expect if there are two AUG...AGG sequences in a given genome. > > >>> import re > >>> m=re.compile(r"^(\w\w\w)*(AUG)((\w\w\w)*?)(AGG)") > >>> m.search("BBBBBBAUGWWWWWWAGGBBBBBB").groups() > ('BBB', 'AUG', 'WWWWWW', 'WWW', 'AGG') > >>> m.search("BBBQBBBAUGWWWWWWAGGBBBBBB") > >>> m.search("BBBQQBBBAUGWWWWWWAGGBBBBBB") > >>> m.search("BBBQQBBQBAUGWWWWWWAGGBBBBBB") > <_sre.SRE_Match object at 0xb7de33e0> > >>> m.search("BBBQQBBQBAUGWWWWWWAGGBBBBBB").groups() > ('BQB', 'AUG', 'WWWWWW', 'WWW', 'AGG') > >>> m.search("BBBQQBBQBAUGWQWWWWWAGGBBBBBB") > >>> m.search("BBBQQBBQBAUGWWWWQWWAGGBBBBBB") > >>> m.search("BBBQQBBQBAUGWWQWWQWWAGGBBBBBB") > >>> m.search("BBBQQBBQBAUGWWQWAWQWWAGGBBBBBB") > <_sre.SRE_Match object at 0xb7de33e0> > >>> m.search("BBBQQBBQBAUGWWQWAWQWWAGGBBBBBB").groups() > ('BQB', 'AUG', 'WWQWAWQWW', 'QWW', 'AGG') > >>> > > > This way, I could scan for genes, remove the first letter, scan for > > more genes, remove the first letter again, and scan for more genes. > > This would hypothetically yield different genes, since the frame > > would be shifted. > > Of you could just unconstrain the first match and it will do them all > at once :- > > (AUG)((\w\w\w)*?)(AGG) > > You could run this with re.findall, but beware that this will only > return non-overlapping matches which may not be what you want. > > I'm not sure re's are the best tool for the job, but they should give > you a quick idea of what the answers might be. > > -- > Nick Craig-Wood <[EMAIL PROTECTED]> --http://www.craig-wood.com/nick
Thank you! Your suggestion was overly helpful. Also thank you for the package suggestions. BioPython is on my plate to check out, but I needed a kind of quick fix for this one. The documentation for biopython seems pretty thick - I'm not a biologist so I'm not even sure what kind of packages I'm even looking for. thanks! Blaine -- http://mail.python.org/mailman/listinfo/python-list