On 22/08/2022 05:30, Chris Angelico wrote:
On Mon, 22 Aug 2022 at 10:04, Buck Evan <buck.2...@gmail.com> wrote:

I've had much success doing round trips through the lxml.html parser.

https://lxml.de/lxmlhtml.html

I ditched bs for lxml long ago and never regretted it.

If you find that you have a bunch of invalid html that lxml inadvertently 
"fixes", I would recommend adding a stutter-step to your project: perform a 
noop roundtrip thru lxml on all files. I'd then analyze any diff by progressively 
excluding changes via `grep -vP`.
Unless I'm mistaken, all such changes should fall into no more than a dozen 
groups.


Will this round-trip mutate every single file and reorder the tag
attributes? Because I really don't want to manually eyeball all those
changes.

Most certainly not. Reordering is a bs4 feature that is governed by a
formatter. You can easily prevent that attributes are reorderd:

>>> import bs4
>>> soup = bs4.BeautifulSoup("""<div beta="1" alpha="2"/>""")
>>> soup
<html><body><div alpha="2" beta="1"></div></body></html>
>>> class Formatter(bs4.formatter.HTMLFormatter):
    def attributes(self, tag):
        return [] if tag.attrs is None else list(tag.attrs.items())

>>> soup.decode(formatter=Formatter())
'<html><body><div beta="1" alpha="2"></div></body></html>'

Blank space is probably removed by the underlying html parser.
It might be possible to make bs4 instantiate the lxml.html.HTMLParser
with remove_blank_text=False, but I didn't try hard enough ;)

That said, for my humble html scraping needs I have ditched bs4 in favor
of lxml and its xpath capabilities.


--
https://mail.python.org/mailman/listinfo/python-list

Reply via email to