> On 21 Aug 2022, at 09:12, Chris Angelico <ros...@gmail.com> wrote: > > On Sun, 21 Aug 2022 at 17:26, Barry <ba...@barrys-emacs.org> wrote: >> >> >> >>>> On 19 Aug 2022, at 22:04, Chris Angelico <ros...@gmail.com> wrote: >>> >>> On Sat, 20 Aug 2022 at 05:12, Barry <ba...@barrys-emacs.org> wrote: >>>> >>>> >>>> >>>>>> On 19 Aug 2022, at 19:33, Chris Angelico <ros...@gmail.com> wrote: >>>>> >>>>> What's the best way to precisely reconstruct an HTML file after >>>>> parsing it with BeautifulSoup? >>>> >>>> I recall that in bs4 it parses into an object tree and loses the detail of >>>> the input. >>>> I recently ported from very old bs to bs4 and hit the same issue. >>>> So no it will not output the same as went in. >>>> >>>> If you can trust the input to be parsed as xml, meaning all the rules of >>>> closing >>>> tags have been followed. Then I think you can parse and unparse thru xml to >>>> do what you want. >>>> >>> >>> >>> Yeah, no I can't, this is HTML 4 with a ton of inconsistencies. Oh >>> well. Thanks for trying, anyhow. >>> >>> So I'm left with a few options: >>> >>> 1) Give up on validation, give up on verification, and just run this >>> thing on the production site with my fingers crossed >> >> Can you build a beta site with original intack? > > In a naive way, a full copy would be quite a few gigabytes. I could > cut that down a good bit by taking only HTML files and the things they > reference, but then we run into the same problem of broken links, > which is what we're here to solve in the first place. > > But I would certainly not want to run two copies of the site and then > manually compare. > >> Also wonder if using selenium to walk the site may work as a verification >> step? >> I cannot recall if you can get an image of the browser window to do image >> compares with to look for rendering differences. > > Image recognition won't necessarily even be valid; some of the changes > will have visual consequences (eg a broken image reference now > becoming correct), and as soon as that happens, the whole document can > reflow. > >> From my one task using bs4 I did not see it produce any bad results. >> In my case the problems where in the code that built on bs1 using bad >> assumptions. > > Did that get run on perfect HTML, or on messy real-world stuff that > uses quirks mode?
I small number of messy html pages. Barry > > ChrisA > -- > https://mail.python.org/mailman/listinfo/python-list -- https://mail.python.org/mailman/listinfo/python-list