Re: [Tutor] Extract main text from HTML document

Simon Connah Sun, 06 May 2018 11:30:56 -0700

Thanks for the replies, everyone. Beautiful Soup looks like a good option.

My primary goal is to extract the main body text, the title and the
meta description from a web page and run it through one of the cloud
Natural Language processing services to find out some information that
I'd like to know and I'd like to do it to quite a few websites.


This is all for a little project I have in mind. I'm not even sure if
it'll work but it'll be fun to try. I might have to do some custom
work on top of what Beautiful Soup offers though as I need to get very
specific data in a certain format.

On 5 May 2018 at 22:43, boB Stepp <[email protected]> wrote:
> On Sat, May 5, 2018 at 12:59 PM, Simon Connah <[email protected]> wrote:
>
>> I was wondering if there was a way in which I could download a web
>> page and then just extract the main body of text without all of the
>> HTML.
>
> I do not have any experience with this, but I like to collect books.
> One of them [1] says on page 245:
>
> "Beautiful Soup is a module for extracting information from an HTML
> page (and is much better for this purpose than regular expressions)."
>
> I believe this topic has come up before on this list as well as the
> main Python list.  You may want to check it out.  It can be installed
> with pip.
>
> [1] "Automate the Boring Stuff with Python -- Practical Programming
> for Total Beginners" by Al Sweigart.
>
> HTH!
> --
> boB
> _______________________________________________
> Tutor maillist  -  [email protected]
> To unsubscribe or change subscription options:
> https://mail.python.org/mailman/listinfo/tutor
_______________________________________________
Tutor maillist  -  [email protected]
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor

Re: [Tutor] Extract main text from HTML document

Reply via email to