On Sun, 21 Aug 2022 at 03:27, Stefan Ram <r...@zedat.fu-berlin.de> wrote: > > 2qdxy4rzwzuui...@potatochowder.com writes: > >textual representations. That way, the following two elements are the > >same (and similar with a collection of sub-elements in a different order > >in another document): > > The /elements/ differ. They have the /same/ infoset.
That's the bit that's hard to prove. > The OP could edit the files with regexps to create a new version. To you and Jon, who also suggested this: how would that be beneficial? With Beautiful Soup, I have the line number and position within the line where the tag starts; what does a regex give me that I don't have that way? > Soup := BeautifulSoup. > > Then have Soup read both the new version and the old version. > > Then have Soup also edit the old version read in, the same way as > the regexps did and verify that now the old version edited by > Soup and the new version created using regexps agree. > > Or just use Soup as a tool to show the diffs for visual inspection > by having Soup read both the original version and the version edited > with regexps. Now both are normalized by Soup and Soup can show the > diffs (such a diff feature might not be a part of Soup, but it should > not be too much effort to write one using Soup). > But as mentioned, the entire problem *is* the normalization, as I have no proof that it has had no impact on the rendering of the page. Comparing two normalized versions is no better than my original option 1, whereby I simply ignore the normalization and write out the reconstructed content. It's easy if you know for certain that the page is well-formed. Much harder if you do not - or, as in some cases, if you know the page is badly-formed. ChrisA -- https://mail.python.org/mailman/listinfo/python-list