On Thu, May 5, 2016, at 12:36 AM, Steven D'Aprano wrote: > Oh, a further thought... > > On Thursday 05 May 2016 16:46, Stephen Hansen wrote: > > I don't even care about faster: Its overly complicated. Sometimes a > > regular expression really is the clearest way to solve a problem. > > Putting non-ASCII letters aside for the moment, how would you match these > specs as a regular expression?
I don't know, but mostly because I wouldn't even try. The requirements are over-specified. If you look at the OP's data (and based on previous conversation), he's doing web scraping and trying to pull out good data. There's no absolutely perfect way to do that because the system he's scraping isn't meant for data processing. The data isn't cleanly articulated. Instead, he wants a heuristic to pull out what look like section titles. The OP looked at the data and came up with a simple set of rules that identify these section titles: >> Want to keep all elements containing only upper case letters or upper case letters and ampersand (where ampersand is surrounded by spaces) This translates naturally into a simple regular expression: an uppercase string with spaces and &'s. Now, that expression doesn't 100% encode every detail of that rule-- it allows both Q&A and Q & A-- but on my own looking at the data, I suspect its good enough. The titles are clearly separate from the other data scraped by their being upper cased. We just need to expand our allowed character range into spaces and &'s. Nothing in the OP's request demands the kind of rigorous matching that your scenario does. Its a practical problem with a simple, practical answer. -- Stephen Hansen m e @ i x o k a i . i o -- https://mail.python.org/mailman/listinfo/python-list