alister writes: > On Wed, 15 Jun 2016 15:55:42 +0300, Jussi Piitulainen wrote: > >> alister writes: >> >>> On Tue, 14 Jun 2016 20:28:24 -0700, Yubin Ruan wrote: >>> >>>> Hi everyone, >>>> I am struggling writing a right regex that match what I want: >>>> >>>> Problem Description: >>>> >>>> Given a string like this: >>>> >>>> >>>string = "false_head <a>aaa</a> <a>bbb</a> false_tail \ >>>> true_head some_text_here <a>ccc</a> <a>ddd</a> <a>eee</a> >>>> true_tail" >>>> >>>> I want to match the all the text surrounded by those "<a> </a>", >>>> but only if those "<a> </a>" locate **in some distance** behind >>>> "true_head". That is, I expect to result to be like this: >>>> >>>> >>>import re result = re.findall("the_regex",string) print result >>>> ["ccc","ddd","eee"] >>>> >>>> How can I write a regex to match that? >>>> I have try to use the **positive lookbehind assertion** in python >>>> regex, >>>> but it does not allowed variable length of lookbehind. >>>> >>>> Thanks in advance, >>>> Ruan >>> >>> don't try to use regex to parse html it wont work reliably i am >>> surprised no one has mentioned beautifulsoup yet, which is probably >>> what you require. >> >> Nothing in the question indicates that the data is HTML. > > the <a></a> tags are a prety good indicator though
I can see how they point that way, but to me that alone seemed pretty weak. > even if it is not HTML the same advise stands for XML (the quote > example would be invalid if it was XML) It's not valid HTML either, for similar reasons. Or is it? I don't even want to know. > if it is neither for these formats but still using a similar tag > structure then I would say that Reg ex is still unsuitable & the OP > would need to write a full parser for the format if one does not > already exist That depends on details that weren't provided. I work with a data format that mixes element tags with line-oriented data records, and having a dedicated parser would be more of a hassle. A couple of very simple regexen are useful in making sure that start tags have a valid form and extracting attribute-value pairs from them. I'm not at all experiencing "two problems" here. Some uses of regex are good. (And now I may be about to experience the third problem. That makes me sad.) Anyway, I think you and another person guessed correctly that the OP is indeed really considering HTML, and then your suggestion is certainly helpful. -- https://mail.python.org/mailman/listinfo/python-list