On Mon, 24 Oct 2022 at 21:33, Peter J. Holzer <hjp-pyt...@hjp.at> wrote: > Ron has already noted that the lxml and html5 parser do the right thing, > so just for the record: > > The HTML fragment above is well-formed and contains a number of li > elements at the same level directly below the ol element, not lots of > nested li elements. The end tag of the li element is optional (except in > XHTML) and li elements don't nest.
That's correct. However, parsing it with html.parser and then reconstituting it as shown in the example code results in all the </li> tags coming up right before the </ol>, indicating that the <li> tags were parsed as deeply nested rather than as siblings. In order to get a successful parse out of this, I need something which sees them as siblings, which html5lib seems to be doing fine. Whether it has other issues, I don't know, but I guess I'll find out.... it's currently running on the live site and taking several hours (due to network delays and the server being slow, so I don't really want to parallelize and overload the thing). ChrisA -- https://mail.python.org/mailman/listinfo/python-list