Hello, ti 7. jouluk. 2021 klo 20.08 Chris Angelico (ros...@gmail.com) kirjoitti:
> On Wed, Dec 8, 2021 at 4:55 AM Julius Hamilton > <juliushamilton...@gmail.com> wrote: > > > > Hey, > > > > Could anyone please comment on the purest way simply to strip HTML tags > > from the internal text they surround? > > > > I know Beautiful Soup is a convenient tool, but I’m interested to know > what > > the most minimal way to do it would be. > > That's definitely the best and most general way, and would still be my > first thought most of the time. > > > People say you usually don’t use Regex for a second order language like > > HTML, so I was thinking about using xpath or lxml, which seem like very > > pure, universal tools for the job. > > > > I did find an example for doing this with the re module, though. > > > > Would it be fair to say that to just strip the tags, Regex is fine, but > you > > need to build a tree-like object if you want the ability to select which > > nodes to keep and which to discard? > > Obligatory reference: > > > https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags > > > Can xpath / lxml do that? > > > > What are the chief differences between xpath / lxml and Beautiful Soup? > > > > I've never directly used lxml, mainly because bs4 offers all the same > advantages and more, with about the same costs. However, if you're > looking for a no-external-deps option, Python *does* include an HTML > parser in the standard library: > > But isn't bs4 only for SOAP content? Can bs4 or lxml cope with HTML code that does not comply with XML as the following fragment? <p>A <p>B <hr> BR, Roland > https://docs.python.org/3/library/html.parser.html > > If your purpose is extremely simple (like "strip tags, search for > text"), then it should be easy enough to whip up something using that > module. No external deps, not a lot of code, pretty straight-forward. > On the other hand, if you're trying to do an "HTML to text" > conversion, you'd probably need to be aware of which tags are > block-level and which are inline content, so that (for instance) > "<div>Hello</div> <div>world</div>" would come out as two separate > paragraphs of text, whereas the same thing with <b> tags would become > just "Hello world". But for the most part, handle_data will probably > do everything you need. > > ChrisA > -- > https://mail.python.org/mailman/listinfo/python-list > -- https://mail.python.org/mailman/listinfo/python-list