On 22/08/2022 05:30, Chris Angelico wrote:
On Mon, 22 Aug 2022 at 10:04, Buck Evan <buck.2...@gmail.com> wrote:
I've had much success doing round trips through the lxml.html parser.
https://lxml.de/lxmlhtml.html
I ditched bs for lxml long ago and never regretted it.
If you find that you have a bunch of invalid html that lxml inadvertently
"fixes", I would recommend adding a stutter-step to your project: perform a
noop roundtrip thru lxml on all files. I'd then analyze any diff by progressively
excluding changes via `grep -vP`.
Unless I'm mistaken, all such changes should fall into no more than a dozen
groups.
Will this round-trip mutate every single file and reorder the tag
attributes? Because I really don't want to manually eyeball all those
changes.
Most certainly not. Reordering is a bs4 feature that is governed by a
formatter. You can easily prevent that attributes are reorderd:
>>> import bs4
>>> soup = bs4.BeautifulSoup("""<div beta="1" alpha="2"/>""")
>>> soup
<html><body><div alpha="2" beta="1"></div></body></html>
>>> class Formatter(bs4.formatter.HTMLFormatter):
def attributes(self, tag):
return [] if tag.attrs is None else list(tag.attrs.items())
>>> soup.decode(formatter=Formatter())
'<html><body><div beta="1" alpha="2"></div></body></html>'
Blank space is probably removed by the underlying html parser.
It might be possible to make bs4 instantiate the lxml.html.HTMLParser
with remove_blank_text=False, but I didn't try hard enough ;)
That said, for my humble html scraping needs I have ditched bs4 in favor
of lxml and its xpath capabilities.
--
https://mail.python.org/mailman/listinfo/python-list