2016-06-15 5:28 GMT+02:00 Yubin Ruan <ablacktsh...@gmail.com>: > Hi everyone, > I am struggling writing a right regex that match what I want: > > Problem Description: > > Given a string like this: > > >>>string = "false_head <a>aaa</a> <a>bbb</a> false_tail \ > true_head some_text_here <a>ccc</a> <a>ddd</a> <a>eee</a> > true_tail" > > I want to match the all the text surrounded by those "<a> </a>", > but only if those "<a> </a>" locate **in some distance** behind "true_head". > That is, I expect to result to be like this: > > >>>import re > >>>result = re.findall("the_regex",string) > >>>print result > ["ccc","ddd","eee"] > > How can I write a regex to match that? > I have try to use the **positive lookbehind assertion** in python regex, > but it does not allowed variable length of lookbehind. > > Thanks in advance, > Ruan > -- > https://mail.python.org/mailman/listinfo/python-list
Hi, html-like data is generally not very suitable for parsing with regex, as was explained in the previous answers (especially if comments and nesting are massively involved). However, if this suits your data and the usecase, you can use regex with variable-length lookarounds in a much enhanced "regex" library for python https://pypi.python.org/pypi/regex your pattern might then simply have the form you most likely have intended, e.g.: >>> regex.findall(r"(?<=true_head.*)<a>([^<]+)</a>(?=.*true_tail)", "false_head >>> <a>aaa</a> <a>bbb</a> false_tail true_head some_text_here <a>ccc</a> >>> <a>ddd</a> <a>eee</a> true_tail <a>fff</a> another_false_tail") ['ccc', 'ddd', 'eee'] >>> If you are accustomed to use regular expressions, I'd certainly recommend this excellent library (besides unlimited lookarounds, there are repeated and recursive patterns, many unicode-related enhancements, powerful character set operations, even fuzzy matching and much more). hth, vbr -- https://mail.python.org/mailman/listinfo/python-list