On 2022-10-24 13:29:13 +1100, Chris Angelico wrote: > Parsing ancient HTML files is something Beautiful Soup is normally > great at. But I've run into a small problem, caused by this sort of > sloppy HTML: > > from bs4 import BeautifulSoup > # See: https://gsarchive.net/gilbert/plays/princess/tennyson/tenniv.htm > blob = b""" > <OL> > <LI>'THERE sinks the nebulous star we call the Sun, > <LI>If that hypothesis of theirs be sound,' [...] > <LI>Stirring a sudden transport rose and fell. > </OL> > """ > soup = BeautifulSoup(blob, "html.parser") > print(soup) > > > On this small snippet, it works acceptably, but puts a large number of > </li> tags immediately before the </ol>.
Ron has already noted that the lxml and html5 parser do the right thing, so just for the record: The HTML fragment above is well-formed and contains a number of li elements at the same level directly below the ol element, not lots of nested li elements. The end tag of the li element is optional (except in XHTML) and li elements don't nest. hp -- _ | Peter J. Holzer | Story must make more sense than reality. |_|_) | | | | | h...@hjp.at | -- Charles Stross, "Creative writing __/ | http://www.hjp.at/ | challenge!"
signature.asc
Description: PGP signature
-- https://mail.python.org/mailman/listinfo/python-list