Thanks for the replies, everyone. Beautiful Soup looks like a good option. My primary goal is to extract the main body text, the title and the meta description from a web page and run it through one of the cloud Natural Language processing services to find out some information that I'd like to know and I'd like to do it to quite a few websites.
This is all for a little project I have in mind. I'm not even sure if it'll work but it'll be fun to try. I might have to do some custom work on top of what Beautiful Soup offers though as I need to get very specific data in a certain format. On 5 May 2018 at 22:43, boB Stepp <robertvst...@gmail.com> wrote: > On Sat, May 5, 2018 at 12:59 PM, Simon Connah <scopensou...@gmail.com> wrote: > >> I was wondering if there was a way in which I could download a web >> page and then just extract the main body of text without all of the >> HTML. > > I do not have any experience with this, but I like to collect books. > One of them [1] says on page 245: > > "Beautiful Soup is a module for extracting information from an HTML > page (and is much better for this purpose than regular expressions)." > > I believe this topic has come up before on this list as well as the > main Python list. You may want to check it out. It can be installed > with pip. > > [1] "Automate the Boring Stuff with Python -- Practical Programming > for Total Beginners" by Al Sweigart. > > HTH! > -- > boB > _______________________________________________ > Tutor maillist - Tutor@python.org > To unsubscribe or change subscription options: > https://mail.python.org/mailman/listinfo/tutor _______________________________________________ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor