On Fri, Apr 1, 2011 at 2:54 PM, candide <candide@free.invalid> wrote: > Another question relative to regular expressions. > > How to extract all word duplicates in a given text by use of regular > expression methods ? To make the question concrete, if the text is > > ------------------ > Now is better than never. > Although never is often better than *right* now. > ------------------ > > duplicates are : > > ------------------------ > better is now than never > ------------------------ > > Some code can solve the question, for instance > > # ------------------ > import re > > regexp=r"\w+" > > c=re.compile(regexp, re.IGNORECASE) > > text=""" > Now is better than never. > Although never is often better than *right* now.""" > > z=[s.lower() for s in c.findall(text)] > > for d in set([s for s in z if z.count(s)>1]): > print d, > # ------------------ > > but I'm in search of "plain" re code.
You could use a look-ahead assertion with a captured group: >>> regexp = r'\b(?P<dup>\w+)\b(?=.+\b(?P=dup)\b)' >>> c = re.compile(regexp, re.IGNORECASE | re.DOTALL) >>> c.findall(text) But note that this is computationally expensive. The regex that you posted is probably more efficient if you use a collections.Counter object instead of z.count. Cheers, Ian -- http://mail.python.org/mailman/listinfo/python-list