On 2022-08-21, Peter J. Holzer <hjp-pyt...@hjp.at> wrote: > On 2022-08-20 21:51:41 -0000, Jon Ribbens via Python-list wrote: >> On 2022-08-20, Stefan Ram <r...@zedat.fu-berlin.de> wrote: >> > Jon Ribbens <jon+use...@unequivocal.eu> writes: >> >>... or you could avoid all that faff and just do re.sub()? > >> > source = '<a name="b" href="http" accesskey="c"></a>' >> > >> > # Use Python to change the source, keeping the order of attributes. >> > >> > result = re.sub( r'href\s*=\s*"http"', r'href="https"', source ) >> > result = re.sub( r"href\s*=\s*'http'", r"href='https'", result ) > > Depending on the content of the site, this might replace some stuff > which is not a link. > >> You could go a bit harder with the regexp of course, e.g.: >> >> result = re.sub( >> r"""(<\s*a\s+[^>]*href\s*=\s*)(['"])\s*OLD\s*\2""", > > This will fail on: > <a alt="42 > 23" href="the.answer.html">
I've seen *a lot* of bad/broken/weird HTML over the years, and I don't believe I've ever seen anyone do that. (Wrongly putting an 'alt' attribute on an 'a' element is very common, on the other hand ;-) ) > The problem can be solved with regular expressions (and given the > constraints I think I would prefer that to using Beautiful Soup), but > getting the regexps right is not trivial, at least in the general case. I would like to see the regular expression that could fully parse general HTML... -- https://mail.python.org/mailman/listinfo/python-list