On Wed, Dec 8, 2021 at 7:55 AM Roland Mueller <roland.em0...@googlemail.com> wrote: > > Hello, > > ti 7. jouluk. 2021 klo 20.08 Chris Angelico (ros...@gmail.com) kirjoitti: >> >> On Wed, Dec 8, 2021 at 4:55 AM Julius Hamilton >> <juliushamilton...@gmail.com> wrote: >> > >> > Hey, >> > >> > Could anyone please comment on the purest way simply to strip HTML tags >> > from the internal text they surround? >> > >> > I know Beautiful Soup is a convenient tool, but I’m interested to know what >> > the most minimal way to do it would be. >> >> That's definitely the best and most general way, and would still be my >> first thought most of the time. >> >> > People say you usually don’t use Regex for a second order language like >> > HTML, so I was thinking about using xpath or lxml, which seem like very >> > pure, universal tools for the job. >> > >> > I did find an example for doing this with the re module, though. >> > >> > Would it be fair to say that to just strip the tags, Regex is fine, but you >> > need to build a tree-like object if you want the ability to select which >> > nodes to keep and which to discard? >> >> Obligatory reference: >> >> https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags >> >> > Can xpath / lxml do that? >> > >> > What are the chief differences between xpath / lxml and Beautiful Soup? >> > >> >> I've never directly used lxml, mainly because bs4 offers all the same >> advantages and more, with about the same costs. However, if you're >> looking for a no-external-deps option, Python *does* include an HTML >> parser in the standard library: >> > > But isn't bs4 only for SOAP content? > Can bs4 or lxml cope with HTML code that does not comply with XML as the > following fragment? > > <p>A > <p>B > <hr> > > BR, > Roland >
Check out the bs4 docs for some of the things you can do with it :) ChrisA -- https://mail.python.org/mailman/listinfo/python-list