Op 24/10/2022 om 4:29 schreef Chris Angelico:
Parsing ancient HTML files is something Beautiful Soup is normally
great at. But I've run into a small problem, caused by this sort of
sloppy HTML:
from bs4 import BeautifulSoup
# See: https://gsarchive.net/gilbert/plays/princess/tennyson/tenniv.htm
blob = b"""
<OL>
<LI>'THERE sinks the nebulous star we call the Sun,
<LI>If that hypothesis of theirs be sound,'
<LI>Said Ida;' let us down and rest:' and we
<LI>Down from the lean and wrinkled precipices,
<LI>By every coppice-feather'd chasm and cleft,
<LI>Dropt thro' the ambrosial gloom to where below
<LI>No bigger than a glow-worm shone the tent
<LI>Lamp-lit from the inner. Once she lean'd on me,
<LI>Descending; once or twice she lent her hand,
<LI>And blissful palpitations in the blood,
<LI>Stirring a sudden transport rose and fell.
</OL>
"""
soup = BeautifulSoup(blob, "html.parser")
print(soup)
On this small snippet, it works acceptably, but puts a large number of
</li> tags immediately before the </ol>. On the original file (see
link if you want to try it), this blows right through the default
recursion limit, due to the crazy number of "nested" list items.
Is there a way to tell BS4 on parse that these <li> elements end at
the next <li>, rather than waiting for the final </ol>? This would
make tidier output, and also eliminate most of the recursion levels.
Using html5lib (install package html5lib) instead of html.parser seems
to do the trick: it inserts </li> right before the next <li>, and one
before the closing </ol> . On my system the same happens when I don't
specify a parser, but IIRC that's a bit fragile because other systems
can choose different parsers of you don't explicity specify one.
--
"I love science, and it pains me to think that to so many are terrified
of the subject or feel that choosing science means you cannot also
choose compassion, or the arts, or be awed by nature. Science is not
meant to cure us of mystery, but to reinvent and reinvigorate it."
-- Robert Sapolsky
--
https://mail.python.org/mailman/listinfo/python-list