On 2022-08-22 00:09:01 -0000, Jon Ribbens via Python-list wrote: > On 2022-08-21, Peter J. Holzer <hjp-pyt...@hjp.at> wrote: > > On 2022-08-20 21:51:41 -0000, Jon Ribbens via Python-list wrote: > >> result = re.sub( > >> r"""(<\s*a\s+[^>]*href\s*=\s*)(['"])\s*OLD\s*\2""", > > > > This will fail on: > > <a alt="42 > 23" href="the.answer.html"> > > I've seen *a lot* of bad/broken/weird HTML over the years, and I don't > believe I've ever seen anyone do that. (Wrongly putting an 'alt' > attribute on an 'a' element is very common, on the other hand ;-) )
My bad. I meant title, not alt, of course. The unescaped > is completely standard conforming HTML, however (both HTML 4.01 strict and HTML 5). You almost never have to escape > - in fact I can't think of any case right now - and I generally don't (sometimes I do for symmetry with <, but that's an aesthetic choice, not a technical one). > > The problem can be solved with regular expressions (and given the > > constraints I think I would prefer that to using Beautiful Soup), but > > getting the regexps right is not trivial, at least in the general case. > > I would like to see the regular expression that could fully parse > general HTML... That depends on what you mean by "parse". If you mean "construct a DOM tree", you can't since regular expressions (in the mathematical sense, not what's implemented by some programming languages) by definition describe finite automata, and those don't support recursion. But if you mean "split into a sequence of tags and PCDATA's (and then each tag further into its attributes)", that's absolutely possible, and that's all that is needed here. I don't think I have ever implemented a complete solution (if only because stuff like <![CDATA[...]]> is extremely rare), but I should have some Perl code lying around which worked on a wide variety of HTML. I just have to find it again ... hp -- _ | Peter J. Holzer | Story must make more sense than reality. |_|_) | | | | | h...@hjp.at | -- Charles Stross, "Creative writing __/ | http://www.hjp.at/ | challenge!"
signature.asc
Description: PGP signature
-- https://mail.python.org/mailman/listinfo/python-list