Parsing ancient HTML files is something Beautiful Soup is normally great at. But I've run into a small problem, caused by this sort of sloppy HTML:
from bs4 import BeautifulSoup # See: https://gsarchive.net/gilbert/plays/princess/tennyson/tenniv.htm blob = b""" <OL> <LI>'THERE sinks the nebulous star we call the Sun, <LI>If that hypothesis of theirs be sound,' <LI>Said Ida;' let us down and rest:' and we <LI>Down from the lean and wrinkled precipices, <LI>By every coppice-feather'd chasm and cleft, <LI>Dropt thro' the ambrosial gloom to where below <LI>No bigger than a glow-worm shone the tent <LI>Lamp-lit from the inner. Once she lean'd on me, <LI>Descending; once or twice she lent her hand, <LI>And blissful palpitations in the blood, <LI>Stirring a sudden transport rose and fell. </OL> """ soup = BeautifulSoup(blob, "html.parser") print(soup) On this small snippet, it works acceptably, but puts a large number of </li> tags immediately before the </ol>. On the original file (see link if you want to try it), this blows right through the default recursion limit, due to the crazy number of "nested" list items. Is there a way to tell BS4 on parse that these <li> elements end at the next <li>, rather than waiting for the final </ol>? This would make tidier output, and also eliminate most of the recursion levels. ChrisA -- https://mail.python.org/mailman/listinfo/python-list