Re: Mutating an HTML file with BeautifulSoup

Barry Sun, 21 Aug 2022 07:51:41 -0700


> On 21 Aug 2022, at 09:12, Chris Angelico <[email protected]> wrote:
> 
> On Sun, 21 Aug 2022 at 17:26, Barry <[email protected]> wrote:
>> 
>> 
>> 
>>>> On 19 Aug 2022, at 22:04, Chris Angelico <[email protected]> wrote:
>>> 
>>> On Sat, 20 Aug 2022 at 05:12, Barry <[email protected]> wrote:
>>>> 
>>>> 
>>>> 
>>>>>> On 19 Aug 2022, at 19:33, Chris Angelico <[email protected]> wrote:
>>>>> 
>>>>> What's the best way to precisely reconstruct an HTML file after
>>>>> parsing it with BeautifulSoup?
>>>> 
>>>> I recall that in bs4 it parses into an object tree and loses the detail of 
>>>> the input.
>>>> I recently ported from very old bs to bs4 and hit the same issue.
>>>> So no it will not output the same as went in.
>>>> 
>>>> If you can trust the input to be parsed as xml, meaning all the rules of 
>>>> closing
>>>> tags have been followed. Then I think you can parse and unparse thru xml to
>>>> do what you want.
>>>> 
>>> 
>>> 
>>> Yeah, no I can't, this is HTML 4 with a ton of inconsistencies. Oh
>>> well. Thanks for trying, anyhow.
>>> 
>>> So I'm left with a few options:
>>> 
>>> 1) Give up on validation, give up on verification, and just run this
>>> thing on the production site with my fingers crossed
>> 
>> Can you build a beta site with original intack?
> 
> In a naive way, a full copy would be quite a few gigabytes. I could
> cut that down a good bit by taking only HTML files and the things they
> reference, but then we run into the same problem of broken links,
> which is what we're here to solve in the first place.
> 
> But I would certainly not want to run two copies of the site and then
> manually compare.
> 
>> Also wonder if using selenium to walk the site may work as a verification 
>> step?
>> I cannot recall if you can get an image of the browser window to do image 
>> compares with to look for rendering differences.
> 
> Image recognition won't necessarily even be valid; some of the changes
> will have visual consequences (eg a broken image reference now
> becoming correct), and as soon as that happens, the whole document can
> reflow.
> 
>> From my one task using bs4 I did not see it produce any bad results.
>> In my case the problems where in the code that built on bs1 using bad 
>> assumptions.
> 
> Did that get run on perfect HTML, or on messy real-world stuff that
> uses quirks mode?


I small number of messy html pages.

Barry

> 
> ChrisA
> -- 
> https://mail.python.org/mailman/listinfo/python-list

-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Mutating an HTML file with BeautifulSoup

Reply via email to