(Oops, accidentally only sent to Chris instead of to the list)
Op 24/10/2022 om 10:02 schreef Chris Angelico:
On Mon, 24 Oct 2022 at 18:43, Roel Schroeven <r...@roelschroeven.net>
wrote:
> Using html5lib (install package html5lib) instead of html.parser seems
> to do the trick: it inserts </li> right before the next <li>, and one
> before the closing </ol> . On my system the same happens when I don't
> specify a parser, but IIRC that's a bit fragile because other systems
> can choose different parsers of you don't explicity specify one.
>
Ah, cool. Thanks. I'm not entirely sure of the various advantages and
disadvantages of the different parsers; is there a tabulation
anywhere, or at least a list of recommendations on choosing a suitable
parser?
There's a bit of information here:
https://beautiful-soup-4.readthedocs.io/en/latest/#installing-a-parser
Not much but maybe it can be helpful.
I'm dealing with a HUGE mess of different coding standards, all the
way from 1990s-level stuff (images for indentation, tables for
formatting, and <FONT FACE="Wingdings">) up through HTML4 (a good few
of the pages have at least some <meta> tags and declare their
encodings, mostly ISO-8859-1 or similar), to fairly modern HTML5.
There's even a couple of pages that use frames - yes, the old style
with a <noframes> block in case the browser can't handle it. I went
with html.parser on the expectation that it'd give the best "across
all standards" results, but I'll give html5lib a try and see if it
does better.
Would rather not try to use different parsers for different files, but
if necessary, I'll figure something out.
(For reference, this is roughly 9000 HTML files that have to be
parsed. Doing things by hand is basically not an option.)
I'd give lxml a try too. Maybe try to preprocess the HTML using
html-tidy (https://www.html-tidy.org/), that might actually do a pretty
good job of getting rid of all kinds of historical inconsistencies.
Somehow checking if any solution works for thousands of input files will
always be a pain, I'm afraid.
--
"I've come up with a set of rules that describe our reactions to technologies:
1. Anything that is in the world when you’re born is normal and ordinary and is
just a natural part of the way the world works.
2. Anything that's invented between when you’re fifteen and thirty-five is new
and exciting and revolutionary and you can probably get a career in it.
3. Anything invented after you're thirty-five is against the natural order of
things."
-- Douglas Adams, The Salmon of Doubt
--
https://mail.python.org/mailman/listinfo/python-list