Two things. The first thing is that you can download the page as a string and delete a everything between tags. Secondly It might be worth looking at Udacity cs101 as this course is all about a search engine. On Sat, 5 May 2018 at 22:27, Simon Connah <scopensou...@gmail.com> wrote:
> Hi, > > I'm writing a very simple web scraper. It'll download a page from a > website and then store the result in a database of some sort. The > problem is that this will obviously include a whole heap of HTML, > JavaScript and maybe even some CSS. None of which is useful to me. > > I was wondering if there was a way in which I could download a web > page and then just extract the main body of text without all of the > HTML. > > The title is obviously easy but the main body of text could contain > all sorts of HTML and I'm interested to know how I might go about > removing the bits that are not needed but still keep the meaning of > the document intact. > > Does anyone have any suggestions on this front at all? > > Thanks for any help. > > Simon. > _______________________________________________ > Tutor maillist - Tutor@python.org > To unsubscribe or change subscription options: > https://mail.python.org/mailman/listinfo/tutor > _______________________________________________ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor