On 2022-08-20 21:51:41 -0000, Jon Ribbens via Python-list wrote: > On 2022-08-20, Stefan Ram <r...@zedat.fu-berlin.de> wrote: > > Jon Ribbens <jon+use...@unequivocal.eu> writes: > >>... or you could avoid all that faff and just do re.sub()?
> > source = '<a name="b" href="http" accesskey="c"></a>' > > > > # Use Python to change the source, keeping the order of attributes. > > > > result = re.sub( r'href\s*=\s*"http"', r'href="https"', source ) > > result = re.sub( r"href\s*=\s*'http'", r"href='https'", result ) Depending on the content of the site, this might replace some stuff which is not a link. > You could go a bit harder with the regexp of course, e.g.: > > result = re.sub( > r"""(<\s*a\s+[^>]*href\s*=\s*)(['"])\s*OLD\s*\2""", This will fail on: <a alt="42 > 23" href="the.answer.html"> The problem can be solved with regular expressions (and given the constraints I think I would prefer that to using Beautiful Soup), but getting the regexps right is not trivial, at least in the general case. It may become a lot easier if you know that certain conventions were followed (e.g. that ">" was always written as ">") or it may become even harder when the files contain errors. hp -- _ | Peter J. Holzer | Story must make more sense than reality. |_|_) | | | | | h...@hjp.at | -- Charles Stross, "Creative writing __/ | http://www.hjp.at/ | challenge!"
signature.asc
Description: PGP signature
-- https://mail.python.org/mailman/listinfo/python-list