I've had much success doing round trips through the lxml.html parser. https://lxml.de/lxmlhtml.html
I ditched bs for lxml long ago and never regretted it. If you find that you have a bunch of invalid html that lxml inadvertently "fixes", I would recommend adding a stutter-step to your project: perform a noop roundtrip thru lxml on all files. I'd then analyze any diff by progressively excluding changes via `grep -vP`. Unless I'm mistaken, all such changes should fall into no more than a dozen groups. On Fri, Aug 19, 2022, 1:34 PM Chris Angelico <ros...@gmail.com> wrote: > What's the best way to precisely reconstruct an HTML file after > parsing it with BeautifulSoup? > > Using the Alice example from the BS4 docs: > > >>> html_doc = """<html><head><title>The Dormouse's story</title></head> > <body> > <p class="title"><b>The Dormouse's story</b></p> > > <p class="story">Once upon a time there were three little sisters; and > their names were > <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>, > <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and > <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; > and they lived at the bottom of a well.</p> > > <p class="story">...</p> > """ > >>> print(soup) > <html><head><title>The Dormouse's story</title></head> > <body> > <p class="title"><b>The Dormouse's story</b></p> > <p class="story">Once upon a time there were three little sisters; and > their names were > <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, > <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and > <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>; > and they lived at the bottom of a well.</p> > <p class="story">...</p> > </body></html> > >>> > > Note two distinct changes: firstly, whitespace has been removed, and > secondly, attributes are reordered (I think alphabetically). There are > other canonicalizations being done, too. > > I'm trying to make some automated changes to a huge number of HTML > files, with minimal diffs so they're easy to validate. That means that > spurious changes like these are very much unwanted. Is there a way to > get BS4 to reconstruct the original precisely? > > The mutation itself would be things like finding an anchor tag and > changing its href attribute. Fairly simple changes, but might alter > the length of the file (eg changing "http://example.com/" into > "https://example.com/"). I'd like to do them intelligently rather than > falling back on element.sourceline and element.sourcepos, but worst > case, that's what I'll have to do (which would be fiddly). > > ChrisA > -- > https://mail.python.org/mailman/listinfo/python-list > -- https://mail.python.org/mailman/listinfo/python-list