Re: Mutating an HTML file with BeautifulSoup

Jon Ribbens via Python-list Mon, 22 Aug 2022 07:58:01 -0700

On 2022-08-21, Peter J. Holzer <hjp-pyt...@hjp.at> wrote:
> On 2022-08-20 21:51:41 -0000, Jon Ribbens via Python-list wrote:
>> On 2022-08-20, Stefan Ram <r...@zedat.fu-berlin.de> wrote:
>> > Jon Ribbens <jon+use...@unequivocal.eu> writes:
>> >>... or you could avoid all that faff and just do re.sub()?
>
>> > source = '<a name="b" href="http" accesskey="c"></a>'
>> >
>> > # Use Python to change the source, keeping the order of attributes.
>> >
>> > result = re.sub( r'href\s*=\s*"http"', r'href="https"', source )
>> > result = re.sub( r"href\s*=\s*'http'", r"href='https'", result )
>
> Depending on the content of the site, this might replace some stuff
> which is not a link.
>
>> You could go a bit harder with the regexp of course, e.g.:
>> 
>>   result = re.sub(
>>       r"""(<\s*a\s+[^>]*href\s*=\s*)(['"])\s*OLD\s*\2""",
>
> This will fail on:
>     <a alt="42 > 23" href="the.answer.html">


I've seen *a lot* of bad/broken/weird HTML over the years, and I don't
believe I've ever seen anyone do that. (Wrongly putting an 'alt'
attribute on an 'a' element is very common, on the other hand ;-) )

> The problem can be solved with regular expressions (and given the
> constraints I think I would prefer that to using Beautiful Soup), but
> getting the regexps right is not trivial, at least in the general case.

I would like to see the regular expression that could fully parse
general HTML...
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Mutating an HTML file with BeautifulSoup

Reply via email to