On Mon, Jan 20, 2014 at 2:44 AM, km <srikrishnamo...@gmail.com> wrote: > I am trying to find sub sequence patterns but constrained by the order in > which they occur > For example > >>>> p = re.compile('(CAA)+?(TCT)+?(TA)+?') >>>> p.findall('CAACAACAATCTTCTTCTTCTTATATA') > [('CAA', 'TCT', 'TA')] > > But I instead find only one instance of the CAA/TCT/TA in that order. > How can I get 3 matches of CAA, followed by four matches of TCT followed by > 2 matches of TA ? > Well these patterns (CAA/TCT/TA) can occur any number of times and atleast > once so I have to use + in the regex.
You want to include the '+' in the parens so that repetitions are included in the match, but you still want to group CAA etc. together; for that, you can use non-capturing groups. I don't see how TA could ever match two, though. It'd match once as-is, or thrice if you make the repetition greedy (get rid of the ?s). >>> p = re.compile('((?:CAA)+?)((?:TCT)+?)((?:TA)+?)') >>> p.findall('CAACAACAATCTTCTTCTTCTTATATA') [('CAACAACAA', 'TCTTCTTCTTCT', 'TA')] -- Devin -- https://mail.python.org/mailman/listinfo/python-list